Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Y Bai, A Jones, K Ndousse, A Askell, A Chen, N DasSarma, D Drain, ... arXiv preprint arXiv:2204.05862, 2022 | 1452 | 2022 |
In-context learning and induction heads C Olsson, N Elhage, N Nanda, N Joseph, N DasSarma, T Henighan, ... Transformer Circuits Thread, 2022 | 536* | 2022 |
A Mathematical Framework for Transformer Circuits N Elhage, N Nanda, C Olsson, T Henighan, N Joseph, B Mann, A Askell, ... Transformer Circuits Thread, 2021 | 516* | 2021 |
Progress Measures For Grokking Via Mechanistic Interpretability N Nanda, L Chan, T Liberum, J Smith, J Steinhardt ICLR 2023 Spotlight, 2023 | 312* | 2023 |
Predictability and surprise in large generative models D Ganguli, D Hernandez, L Lovitt, A Askell, Y Bai, A Chen, T Conerly, ... Proceedings of the 2022 ACM Conference on Fairness, Accountability, and …, 2022 | 275 | 2022 |
Finding Neurons in a Haystack: Case Studies with Sparse Probing W Gurnee, N Nanda, M Pauly, K Harvey, D Troitskii, D Bertsimas Transactions on Machine Learning Research, 2023 | 113 | 2023 |
Emergent Linear Representations in World Models of Self-Supervised Sequence Models N Nanda, A Lee, M Wattenberg BlackboxNLP at EMNLP 2023, Honourable Mention for Best Paper, 2023 | 104* | 2023 |
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations B Chughtai, L Chan, N Nanda ICML 2023, 2023 | 76 | 2023 |
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla T Lieberum, M Rahtz, J Kramár, N Nanda, G Irving, R Shah, V Mikulik arXiv preprint arXiv:2307.09458, 2023 | 64* | 2023 |
TransformerLens: A Library for Mechanistic Interpretability of Language Models N Nanda, J Bloom https://github.com/neelnanda-io/TransformerLens, 2022 | 60* | 2022 |
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods F Zhang, N Nanda ICLR 2024, 2023 | 55 | 2023 |
Linear Representations of Sentiment in Large Language Models C Tigges, OJ Hollinsworth, A Geiger, N Nanda BlackboxNLP 2024, 2023 | 40 | 2023 |
Refusal in Language Models Is Mediated by a Single Direction A Arditi, O Obeso, A Syed, D Paleka, N Rimsky, W Gurnee, N Nanda NeurIPS 2024, 2024 | 38* | 2024 |
Attribution Patching: Activation Patching At Industrial Scale N Nanda https://www.neelnanda.io/mechanistic-interpretability/attribution-patching, 2023 | 35 | 2023 |
Improving Dictionary Learning with Gated Sparse Autoencoders S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramár, ... NeurIPS 2024, 2024 | 31* | 2024 |
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 T Lieberum, S Rajamanoharan, A Conmy, L Smith, N Sonnerat, V Varma, ... BlackboxNLP 2024, 2024 | 24 | 2024 |
Copy suppression: Comprehensively understanding an attention head C McDougall, A Conmy, C Rushing, T McGrath, N Nanda BlackboxNLP 2024, 2023 | 23 | 2023 |
Universal Neurons in GPT2 Language Models W Gurnee, T Horsley, ZC Guo, TR Kheirkhah, Q Sun, W Hathaway, ... TMLR, 2024 | 21* | 2024 |
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level N Nanda, S Rajamanoharan, J Kramár, R Shah Alignment Forum, https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact …, 2023 | 21* | 2023 |
Neuron to graph: Interpreting language model neurons at scale A Foote, N Nanda, E Kran, I Konstas, S Cohen, F Barez arXiv preprint arXiv:2305.19911, 2023 | 21 | 2023 |