Back to Directory
Dr. Wayne Jones

Dr. Wayne Jones

Allen Institute for AI

LessWrongAcademicEA Forum
57
Papers
28
Posts
24
h-index
Research Topics
Mechanistic InterpretabilityAI RiskRLHF
About

Dr. Wayne Jones is a researcher specializing in mechanistic interpretability, ai risk, rlhf. They have published extensively in top-tier venues and are actively involved in the lesswrong community.

Scaling Monosemanticity: Extracting Interpretable Features from Large Language Models

NeurIPS2024156 citations

Co-authors: Neel Nanda, Jan Leike

Toward Understanding of Circuits in Transformers

ICML2023243 citations

Co-authors: Neel Nanda

Activation Patching: A Causal Lens on Neural Networks

ICLR2023189 citations

Co-authors: Paul Christiano