Neel Nanda

인용

	전체	2019년 이후
서지정보	3996	3992
h-index	20	20
i10-index	28	28

2900

1450

725

2175

202220232024106 1046 2818

공개 액세스

모두 보기

자료 1개

자료 0개

공개

비공개

재정 지원 요구사항 기준

공동 저자

Tom LieberumGoogle DeepMinddeepmind.com의 이메일 확인됨
Catherine OlssonAnthropicmit.edu의 이메일 확인됨
Lawrence ChanPhD Student, UC Berkeleyberkeley.edu의 이메일 확인됨
Christopher OlahAnthropicgoogle.com의 이메일 확인됨
Bilal ChughtaiIndependentcam.ac.uk의 이메일 확인됨

팔로우

Neel Nanda

Mechanistic Interpretability Team Lead, Google DeepMind

deepmind.com의 이메일 확인됨 - 홈페이지

AI ML AI Alignment Interpretability Mechanistic Interpretability


제목 서지정보순 정렬 연도순 정렬 제목순 정렬	인용 인용	연도
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Y Bai, A Jones, K Ndousse, A Askell, A Chen, N DasSarma, D Drain, ... arXiv preprint arXiv:2204.05862, 2022	1452	2022
In-context learning and induction heads C Olsson, N Elhage, N Nanda, N Joseph, N DasSarma, T Henighan, ... Transformer Circuits Thread, 2022	536*	2022
A Mathematical Framework for Transformer Circuits N Elhage, N Nanda, C Olsson, T Henighan, N Joseph, B Mann, A Askell, ... Transformer Circuits Thread, 2021	516*	2021
Progress Measures For Grokking Via Mechanistic Interpretability N Nanda, L Chan, T Liberum, J Smith, J Steinhardt ICLR 2023 Spotlight, 2023	312*	2023
Predictability and surprise in large generative models D Ganguli, D Hernandez, L Lovitt, A Askell, Y Bai, A Chen, T Conerly, ... Proceedings of the 2022 ACM Conference on Fairness, Accountability, and …, 2022	275	2022
Finding Neurons in a Haystack: Case Studies with Sparse Probing W Gurnee, N Nanda, M Pauly, K Harvey, D Troitskii, D Bertsimas Transactions on Machine Learning Research, 2023	113	2023
Emergent Linear Representations in World Models of Self-Supervised Sequence Models N Nanda, A Lee, M Wattenberg BlackboxNLP at EMNLP 2023, Honourable Mention for Best Paper, 2023	104*	2023
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations B Chughtai, L Chan, N Nanda ICML 2023, 2023	76	2023
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla T Lieberum, M Rahtz, J Kramár, N Nanda, G Irving, R Shah, V Mikulik arXiv preprint arXiv:2307.09458, 2023	64*	2023
TransformerLens: A Library for Mechanistic Interpretability of Language Models N Nanda, J Bloom https://github.com/neelnanda-io/TransformerLens, 2022	60*	2022
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods F Zhang, N Nanda ICLR 2024, 2023	55	2023
Linear Representations of Sentiment in Large Language Models C Tigges, OJ Hollinsworth, A Geiger, N Nanda BlackboxNLP 2024, 2023	40	2023
Refusal in Language Models Is Mediated by a Single Direction A Arditi, O Obeso, A Syed, D Paleka, N Rimsky, W Gurnee, N Nanda NeurIPS 2024, 2024	38*	2024
Attribution Patching: Activation Patching At Industrial Scale N Nanda https://www.neelnanda.io/mechanistic-interpretability/attribution-patching, 2023	35	2023
Improving Dictionary Learning with Gated Sparse Autoencoders S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramár, ... NeurIPS 2024, 2024	31*	2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 T Lieberum, S Rajamanoharan, A Conmy, L Smith, N Sonnerat, V Varma, ... BlackboxNLP 2024, 2024	24	2024
Copy suppression: Comprehensively understanding an attention head C McDougall, A Conmy, C Rushing, T McGrath, N Nanda BlackboxNLP 2024, 2023	23	2023
Universal Neurons in GPT2 Language Models W Gurnee, T Horsley, ZC Guo, TR Kheirkhah, Q Sun, W Hathaway, ... TMLR, 2024	21*	2024
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level N Nanda, S Rajamanoharan, J Kramár, R Shah Alignment Forum, https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact …, 2023	21*	2023
Neuron to graph: Interpreting language model neurons at scale A Foote, N Nanda, E Kran, I Konstas, S Cohen, F Barez arXiv preprint arXiv:2305.19911, 2023	21	2023

현재 시스템이 작동되지 않습니다. 나중에 다시 시도해 주세요.

학술자료 1–20

연간 인용횟수

중복된 서지정보

병합된 서지정보

공동 저자 추가공동 저자

팔로우

인용

공동 저자