Projects Directory

Discover active AI safety research projects across academic institutions, independent organizations, and community-driven initiatives. Find collaboration opportunities and track research outputs.

Showing 16 projects

ActiveAcademic

Circuits-Based Neural Network Interpretability

Reverse-engineering neural networks by identifying and understanding individual circuits and their functions within transformer models.

Chris Olah

Anthropic

InterpretabilityMechanistic InterpretabilityDeep Learning

1283

ActiveAcademic

Constitutional AI Development

Developing AI systems that follow a set of principles (a 'constitution') to guide their behavior without extensive human feedback on every output.

Yuntao Bai

Anthropic

AlignmentConstitutional AIRLHF

531

ActiveLessWrong

Agent Foundations Research Agenda

Theoretical research on the mathematical foundations of aligned AI agents, including decision theory, logical uncertainty, and embedded agency.

Eliezer Yudkowsky

MIRI

Agent FoundationsDecision TheoryAlignment

20502

ActiveEA Forum

Global AI Governance Mapping Project

Comprehensive mapping of AI governance initiatives, policies, and actors worldwide to inform effective policy interventions.

Allan Dafoe

Centre for the Governance of AI, Oxford University

AI GovernanceAI PolicyForecasting

8151

Seeking CollaboratorsAcademic

Scalable Oversight Methods

Developing techniques for humans to effectively oversee AI systems even when the AI is performing tasks the human cannot directly evaluate.

Jan Leike

OpenAI

Scalable OversightAlignmentEvaluation

642

ActiveAcademic

Systematic Red Teaming for Large Language Models

Developing comprehensive methodologies for identifying vulnerabilities, harmful outputs, and failure modes in large language models.

Deep Ganguli

Anthropic

Red TeamingEvaluationRobustness

462

ActiveLessWrong

AI Deception Detection Research

Investigating methods to detect when AI systems are being deceptive or strategically withholding information from users or overseers.

Evan Hubinger

Anthropic, MIRI

Deception DetectionAlignmentInterpretability

3121

ActiveIndependent

Eliciting Latent Knowledge (ELK)

Research program focused on getting AI systems to honestly report their internal knowledge, even when they might have incentives to be deceptive.

Paul Christiano

ARC

AlignmentInterpretabilityValue Learning

281

ActiveAcademic

Cooperative AI Foundation Research

Studying how to build AI systems that can cooperate effectively with humans and other AI systems, including multi-agent coordination.

Gillian Hadfield

DeepMind, Oxford University

Agent FoundationsAI GovernanceValue Learning

1050

Showing 1 to 9 of 16 results