Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
We report a significant advance in understanding the internal representations of large language models by successfully extracting interpretable features from Claude 3 Sonnet using sparse autoencoders. We identify features corresponding to a vast range of concepts, including cities, people, code syntax, and abstract concepts like deception and bias. This work demonstrates that mechanistic interpretability can scale to frontier models.