Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation.(2023) R Shah, Q Feuillade-Montixi, S Pour, A Tagade, S Casper, J Rando arXiv preprint arXiv:2311.03348, 2023 | 114 | 2023 |
Frontier Models are Capable of In-context Scheming A Meinke, B Schoen, J Scheurer, M Balesni, R Shah, M Hobbhahn arXiv preprint arXiv:2412.04984, 2024 | 12 | 2024 |
Linearly Structured World Representations in Maze-Solving Transformers M Ivanitskiy, AF Spies, T Räuker, G Corlouer, C Mathwin, L Quirke, ... UniReps: the First Workshop on Unifying Representations in Neural Models, 2023 | 8* | 2023 |
Towards evaluations-based safety cases for AI scheming M Balesni, M Hobbhahn, D Lindner, A Meinke, T Korbak, J Clymer, ... arXiv preprint arXiv:2411.03336, 2024 | 7 | 2024 |
A Configurable Library for Generating and Manipulating Maze Datasets MI Ivanitskiy, R Shah, AF Spies, T Räuker, D Valentine, C Rager, L Quirke, ... arXiv preprint arXiv:2309.10498, 2023 | 4 | 2023 |