Spremljaj
Rusheb Shah
Rusheb Shah
Apollo Research
Preverjeni e-poštni naslov na apolloresearch.ai - Domača stran
Naslov
Navedeno
Navedeno
Leto
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation.(2023)
R Shah, Q Feuillade-Montixi, S Pour, A Tagade, S Casper, J Rando
arXiv preprint arXiv:2311.03348, 2023
1142023
Frontier Models are Capable of In-context Scheming
A Meinke, B Schoen, J Scheurer, M Balesni, R Shah, M Hobbhahn
arXiv preprint arXiv:2412.04984, 2024
122024
Linearly Structured World Representations in Maze-Solving Transformers
M Ivanitskiy, AF Spies, T Räuker, G Corlouer, C Mathwin, L Quirke, ...
UniReps: the First Workshop on Unifying Representations in Neural Models, 2023
8*2023
Towards evaluations-based safety cases for AI scheming
M Balesni, M Hobbhahn, D Lindner, A Meinke, T Korbak, J Clymer, ...
arXiv preprint arXiv:2411.03336, 2024
72024
A Configurable Library for Generating and Manipulating Maze Datasets
MI Ivanitskiy, R Shah, AF Spies, T Räuker, D Valentine, C Rager, L Quirke, ...
arXiv preprint arXiv:2309.10498, 2023
42023
Sistem trenutno ne more izvesti postopka. Poskusite znova pozneje.
Članki 1–5