Back to Directory

Dr. Wayne Jones
Allen Institute for AI
LessWrongAcademicEA Forum
57
Papers
28
Posts
24
h-index
Links
Research Topics
Mechanistic InterpretabilityAI RiskRLHF
About
Dr. Wayne Jones is a researcher specializing in mechanistic interpretability, ai risk, rlhf. They have published extensively in top-tier venues and are actively involved in the lesswrong community.
Scaling Monosemanticity: Extracting Interpretable Features from Large Language Models
NeurIPS2024156 citations
Co-authors: Neel Nanda, Jan Leike
Toward Understanding of Circuits in Transformers
ICML2023243 citations
Co-authors: Neel Nanda
Activation Patching: A Causal Lens on Neural Networks
ICLR2023189 citations
Co-authors: Paul Christiano