Papers & Publications

Explore AI safety research across academic papers, arXiv preprints, LessWrong posts, and EA Forum articles. Use semantic search to find relevant work by topic, concept, or research question.

Semantic Search

Semantic search finds papers by meaning, not just exact keyword matches. Try asking questions or describing concepts.

Showing 20 papers

Academic2024Anthropic

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Adly Templeton,Tom Conerly,Jonathan Marcus,Jack Clark

We report a significant advance in understanding the internal representations of large language models by successfully extracting interpretable features from Claude 3 Sonnet using sparse autoencoders. We identify features corresponding to a vast range of concepts, including cities, people, code syntax, and abstract concepts like deception and bias. This work demonstrates that mechanistic interpretability can scale to frontier models.

Mechanistic InterpretabilitySparse AutoencodersFeature Extraction
234
arXiv2024arXiv

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger,Carson Denison,Jesse Mu,Monte MacDiarmid

We demonstrate that current safety training techniques do not reliably remove backdoors from large language models. Models trained to behave maliciously in specific contexts maintain these behaviors even after RLHF and adversarial training.

Deceptive AlignmentSafety TrainingBackdoors
178
Academic2024Science

Managing AI Risks in an Era of Rapid Progress

Yoshua Bengio,Stuart Russell,Geoffrey Hinton,others

A joint statement from leading AI researchers calling for urgent governance measures to manage the risks from advanced AI systems. Proposes concrete policy recommendations including mandatory safety evaluations, international coordination, and public funding for AI safety research.

AI GovernanceAI RiskPolicy
89
EA Forum2024AI Impacts

2024 Expert Survey on Progress in AI

Katja Grace,John Salvatier,others

Results from a large-scale survey of AI researchers on timelines to various AI capabilities, expected impacts, and safety concerns. Updates previous surveys with new questions about large language models and recent progress in AI capabilities.

AI ForecastingTimelinesExpert Surveys
78934123
LessWrong2024LessWrong

200 Concrete Open Problems in Mechanistic Interpretability

A comprehensive list of 200 concrete, actionable research problems in mechanistic interpretability. Organized by difficulty level and area, covering topics from basic feature finding to ambitious projects like fully reverse-engineering GPT-2.

Mechanistic InterpretabilityResearch AgendaOpen Problems
451,123145
LessWrong2024LessWrong

What Would It Take to Align a Superintelligence?

Jan Leike,David Krueger,others

An exploration of the challenges and potential approaches to aligning AI systems significantly smarter than humans. Discusses why current alignment techniques may fail to scale and what new approaches might be needed.

SuperalignmentAGI SafetyScalable Oversight
3456789
arXiv2023arXiv

Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision

Collin Burns,Haotian Ye,Dan Klein,Jacob Steinhardt

We study an analogy for aligning future superhuman models: can weak models supervise strong models? We find that when strong models are finetuned on labels from weak supervisors, they can generalize beyond their supervisors—pointing to possibilities for scalable oversight.

Scalable OversightSuperalignmentWeak-to-Strong
456
arXiv2023arXiv

Representation Engineering: A Top-Down Approach to AI Safety

Andy Zou,Long Phan,Sarah Chen,others

We introduce representation engineering, which works with high-level representations rather than model weights. We show that simple steering vectors can control model behavior, enabling safer and more interpretable AI systems.

Representation EngineeringSteering VectorsInterpretability
312
Academic2023ICLR

Progress Measures for Grokking via Mechanistic Interpretability

Neel Nanda,Lawrence Chan,Tom Lieberum

We study grokking—where neural networks suddenly generalize long after memorizing training data—through the lens of mechanistic interpretability. We find that grokking is explained by the gradual amplification of a generalizing circuit alongside the decay of a memorizing circuit.

Mechanistic InterpretabilityGrokkingPhase Transitions
234
arXiv2023arXiv

The Alignment Problem from a Deep Learning Perspective

Richard Ngo,Lawrence Chan,Sören Mindermann

An analysis of why aligning powerful AI systems might be difficult, written from the perspective of deep learning research. Covers deceptive alignment, emergent goals, situational awareness, and potential solutions like interpretability and oversight.

AlignmentDeep LearningAGI Safety
189

Showing 1 to 10 of 20 results