Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context M Reid, N Savinov, D Teplyashin, D Lepikhin, T Lillicrap, J Alayrac, ... arXiv preprint arXiv:2403.05530, 2024 | 617 | 2024 |
On the utility of learning about humans for human-AI coordination M Carroll, R Shah, MK Ho, T Griffiths, S Seshia, P Abbeel, A Dragan Advances in Neural Information Processing Systems, 5174-5185, 2019 | 431 | 2019 |
Chlorophyll: Synthesis-aided compiler for low-power spatial architectures PM Phothilimthana, T Jelvis, R Shah, N Totla, S Chasins, R Bodik ACM SIGPLAN Notices 49 (6), 396-407, 2014 | 94 | 2014 |
Preferences Implicit in the State of the World R Shah, D Krasheninnikov, J Alexander, P Abbeel, A Dragan arXiv preprint arXiv:1902.04198, 2019 | 89* | 2019 |
Optimal Policies Tend to Seek Power AM Turner, L Smith, R Shah, A Critch, P Tadepalli arXiv preprint arXiv:1912.01683, 2019 | 87* | 2019 |
On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference R Shah, N Gundotra, P Abbeel, A Dragan International Conference on Machine Learning, 5670-5679, 2019 | 79 | 2019 |
Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals R Shah, V Varma, R Kumar, M Phuong, V Krakovna, J Uesato, Z Kenton arXiv preprint arXiv:2210.01790, 2022 | 67* | 2022 |
Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla T Lieberum, M Rahtz, J Kramár, N Nanda, G Irving, R Shah, V Mikulik arXiv preprint arXiv:2307.09458, 2023 | 54 | 2023 |
The MAGICAL Benchmark for Robust Imitation S Toyer, R Shah, A Critch, S Russell Advances in Neural Information Processing Systems 33, 2020 | 52 | 2020 |
Explaining grokking through circuit efficiency V Varma, R Shah, Z Kenton, J Kramár, R Kumar arXiv preprint arXiv:2309.02390, 2023 | 40 | 2023 |
Evaluating Frontier Models for Dangerous Capabilities M Phuong, M Aitchison, E Catt, S Cogan, A Kaskasoli, V Krakovna, ... arXiv preprint arXiv:2403.13793, 2024 | 35 | 2024 |
Evaluating the Robustness of Collaborative Agents P Knott, M Carroll, S Devlin, K Ciosek, K Hofmann, AD Dragan, R Shah arXiv preprint arXiv:2101.05507, 2021 | 31 | 2021 |
Benefits of Assistance over Reward Learning R Shah, P Freire, N Alex, R Freedman, D Krasheninnikov, L Chan, ... | 31 | 2020 |
Active Inverse Reward Design S Mindermann, R Shah, A Gleave, D Hadfield-Menell arXiv preprint arXiv:1809.03060, 2018 | 31 | 2018 |
An Empirical Investigation of Representation Learning for Imitation X Chen, S Toyer, C Wild, S Emmons, I Fischer, KH Lee, N Alex, SH Wang, ... Thirty-fifth Conference on Neural Information Processing Systems Datasets …, 2021 | 30 | 2021 |
Improving Dictionary Learning with Gated Sparse Autoencoders S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramár, ... arXiv preprint arXiv:2404.16014, 2024 | 26 | 2024 |
The MineRL BASALT Competition on Learning from Human Feedback R Shah, C Wild, SH Wang, N Alex, B Houghton, W Guss, S Mohanty, ... arXiv preprint arXiv:2107.01969, 2021 | 26 | 2021 |
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 T Lieberum, S Rajamanoharan, A Conmy, L Smith, N Sonnerat, V Varma, ... arXiv preprint arXiv:2408.05147, 2024 | 24 | 2024 |
AtP*: An efficient and scalable method for localizing LLM behaviour to components J Kramár, T Lieberum, R Shah, N Nanda arXiv preprint arXiv:2403.00745, 2024 | 18 | 2024 |
Choice Set Misspecification in Reward Inference R Freedman, R Shah, A Dragan CEUR Workshop Proceedings, 2020 | 18 | 2020 |