팔로우
Neel Nanda
Neel Nanda
Mechanistic Interpretability Team Lead, Google DeepMind
deepmind.com의 이메일 확인됨 - 홈페이지
제목
인용
인용
연도
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Y Bai, A Jones, K Ndousse, A Askell, A Chen, N DasSarma, D Drain, ...
arXiv preprint arXiv:2204.05862, 2022
14522022
In-context learning and induction heads
C Olsson, N Elhage, N Nanda, N Joseph, N DasSarma, T Henighan, ...
Transformer Circuits Thread, 2022
536*2022
A Mathematical Framework for Transformer Circuits
N Elhage, N Nanda, C Olsson, T Henighan, N Joseph, B Mann, A Askell, ...
Transformer Circuits Thread, 2021
516*2021
Progress Measures For Grokking Via Mechanistic Interpretability
N Nanda, L Chan, T Liberum, J Smith, J Steinhardt
ICLR 2023 Spotlight, 2023
312*2023
Predictability and surprise in large generative models
D Ganguli, D Hernandez, L Lovitt, A Askell, Y Bai, A Chen, T Conerly, ...
Proceedings of the 2022 ACM Conference on Fairness, Accountability, and …, 2022
2752022
Finding Neurons in a Haystack: Case Studies with Sparse Probing
W Gurnee, N Nanda, M Pauly, K Harvey, D Troitskii, D Bertsimas
Transactions on Machine Learning Research, 2023
1132023
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
N Nanda, A Lee, M Wattenberg
BlackboxNLP at EMNLP 2023, Honourable Mention for Best Paper, 2023
104*2023
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
B Chughtai, L Chan, N Nanda
ICML 2023, 2023
762023
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
T Lieberum, M Rahtz, J Kramár, N Nanda, G Irving, R Shah, V Mikulik
arXiv preprint arXiv:2307.09458, 2023
64*2023
TransformerLens: A Library for Mechanistic Interpretability of Language Models
N Nanda, J Bloom
https://github.com/neelnanda-io/TransformerLens, 2022
60*2022
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
F Zhang, N Nanda
ICLR 2024, 2023
552023
Linear Representations of Sentiment in Large Language Models
C Tigges, OJ Hollinsworth, A Geiger, N Nanda
BlackboxNLP 2024, 2023
402023
Refusal in Language Models Is Mediated by a Single Direction
A Arditi, O Obeso, A Syed, D Paleka, N Rimsky, W Gurnee, N Nanda
NeurIPS 2024, 2024
38*2024
Attribution Patching: Activation Patching At Industrial Scale
N Nanda
https://www.neelnanda.io/mechanistic-interpretability/attribution-patching, 2023
352023
Improving Dictionary Learning with Gated Sparse Autoencoders
S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramár, ...
NeurIPS 2024, 2024
31*2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
T Lieberum, S Rajamanoharan, A Conmy, L Smith, N Sonnerat, V Varma, ...
BlackboxNLP 2024, 2024
242024
Copy suppression: Comprehensively understanding an attention head
C McDougall, A Conmy, C Rushing, T McGrath, N Nanda
BlackboxNLP 2024, 2023
232023
Universal Neurons in GPT2 Language Models
W Gurnee, T Horsley, ZC Guo, TR Kheirkhah, Q Sun, W Hathaway, ...
TMLR, 2024
21*2024
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level
N Nanda, S Rajamanoharan, J Kramár, R Shah
Alignment Forum, https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact …, 2023
21*2023
Neuron to graph: Interpreting language model neurons at scale
A Foote, N Nanda, E Kran, I Konstas, S Cohen, F Barez
arXiv preprint arXiv:2305.19911, 2023
212023
현재 시스템이 작동되지 않습니다. 나중에 다시 시도해 주세요.
학술자료 1–20