Defining and characterizing reward gaming J Skalse, N Howe, D Krasheninnikov, D Krueger Advances in Neural Information Processing Systems 35, 9460-9471, 2022 | 252 | 2022 |
Risks from learned optimization in advanced machine learning systems E Hubinger, C van Merwijk, V Mikulik, J Skalse, S Garrabrant arXiv preprint arXiv:1906.01820, 2019 | 168 | 2019 |
Is SGD a Bayesian sampler? Well, almost C Mingard, G Valle-Pérez, J Skalse, AA Louis Journal of Machine Learning Research 22 (79), 1-64, 2021 | 59 | 2021 |
Invariance in policy optimisation and partial identifiability in reward learning JMV Skalse, M Farrugia-Roberts, S Russell, A Abate, A Gleave International Conference on Machine Learning, 32033-32058, 2023 | 52 | 2023 |
Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems D Dalrymple, J Skalse, Y Bengio, S Russell, M Tegmark, S Seshia, ... arXiv preprint arXiv:2405.06624, 2024 | 44 | 2024 |
Neural networks are a priori biased towards boolean functions with low entropy C Mingard, J Skalse, G Valle-Pérez, D Martínez-Rubio, V Mikulik, ... arXiv preprint arXiv:1909.11522, 2019 | 35 | 2019 |
Misspecification in inverse reinforcement learning J Skalse, A Abate Proceedings of the AAAI Conference on Artificial Intelligence 37 (12), 15136 …, 2023 | 34 | 2023 |
Lexicographic multi-objective reinforcement learning J Skalse, L Hammond, C Griffin, A Abate arXiv preprint arXiv:2212.13769, 2022 | 29 | 2022 |
Reinforcement learning in Newcomblike environments J Bell, L Linsefors, C Oesterheld, J Skalse Advances in Neural Information Processing Systems 34, 22146-22157, 2021 | 18 | 2021 |
On the limitations of markovian rewards to express multi-objective, risk-sensitive, and modal tasks J Skalse, A Abate Uncertainty in Artificial Intelligence, 1974-1984, 2023 | 13 | 2023 |
Goodhart's Law in Reinforcement Learning J Karwowski, O Hayman, X Bai, K Kiendlhofer, C Griffin, J Skalse arXiv preprint arXiv:2310.09144, 2023 | 11 | 2023 |
STARC: A general framework for quantifying differences between reward functions J Skalse, L Farnik, SR Motwani, E Jenner, A Gleave, A Abate arXiv preprint arXiv:2309.15257, 2023 | 7 | 2023 |
Towards guaranteed safe AI: a framework for ensuring robust and reliable AI systems (2024) D Dalrymple, J Skalse, Y Bengio, S Russell, M Tegmark, S Seshia, ... arXiv preprint arXiv:2405.06624, 0 | 7 | |
Quantifying the sensitivity of inverse reinforcement learning to misspecification J Skalse, A Abate arXiv preprint arXiv:2403.06854, 2024 | 5 | 2024 |
A general framework for reward function distances E Jenner, JMV Skalse, A Gleave NeurIPS ML Safety Workshop, 2022 | 5 | 2022 |
The reward hypothesis is false JMV Skalse, A Abate | 5 | 2022 |
On the expressivity of objective-specification formalisms in reinforcement learning R Subramani, M Williams, M Heitmann, H Holm, C Griffin, J Skalse arXiv preprint arXiv:2310.11840, 2023 | 4 | 2023 |
All’s well that ends well: Avoiding side effects with distance-impact penalties C Griffin, JMV Skalse, L Hammond, A Abate NeurIPS ML Safety Workshop, 2022 | 2 | 2022 |
A General Counterexample to Any Decision Theory and Some Responses J Skalse arXiv preprint arXiv:2101.00280, 2021 | 2 | 2021 |
Safety Properties of Inductive Logic Programming. G Leech, N Schoots, J Skalse SafeAI@ AAAI, 2021 | 2 | 2021 |