Theo dõi
Joar Skalse
Joar Skalse
DPhil Student in Computer Science, Oxford University
Email được xác minh tại cs.ox.ac.uk
Tiêu đề
Trích dẫn bởi
Trích dẫn bởi
Năm
Defining and characterizing reward gaming
J Skalse, N Howe, D Krasheninnikov, D Krueger
Advances in Neural Information Processing Systems 35, 9460-9471, 2022
2522022
Risks from learned optimization in advanced machine learning systems
E Hubinger, C van Merwijk, V Mikulik, J Skalse, S Garrabrant
arXiv preprint arXiv:1906.01820, 2019
1682019
Is SGD a Bayesian sampler? Well, almost
C Mingard, G Valle-Pérez, J Skalse, AA Louis
Journal of Machine Learning Research 22 (79), 1-64, 2021
592021
Invariance in policy optimisation and partial identifiability in reward learning
JMV Skalse, M Farrugia-Roberts, S Russell, A Abate, A Gleave
International Conference on Machine Learning, 32033-32058, 2023
522023
Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems
D Dalrymple, J Skalse, Y Bengio, S Russell, M Tegmark, S Seshia, ...
arXiv preprint arXiv:2405.06624, 2024
442024
Neural networks are a priori biased towards boolean functions with low entropy
C Mingard, J Skalse, G Valle-Pérez, D Martínez-Rubio, V Mikulik, ...
arXiv preprint arXiv:1909.11522, 2019
352019
Misspecification in inverse reinforcement learning
J Skalse, A Abate
Proceedings of the AAAI Conference on Artificial Intelligence 37 (12), 15136 …, 2023
342023
Lexicographic multi-objective reinforcement learning
J Skalse, L Hammond, C Griffin, A Abate
arXiv preprint arXiv:2212.13769, 2022
292022
Reinforcement learning in Newcomblike environments
J Bell, L Linsefors, C Oesterheld, J Skalse
Advances in Neural Information Processing Systems 34, 22146-22157, 2021
182021
On the limitations of markovian rewards to express multi-objective, risk-sensitive, and modal tasks
J Skalse, A Abate
Uncertainty in Artificial Intelligence, 1974-1984, 2023
132023
Goodhart's Law in Reinforcement Learning
J Karwowski, O Hayman, X Bai, K Kiendlhofer, C Griffin, J Skalse
arXiv preprint arXiv:2310.09144, 2023
112023
STARC: A general framework for quantifying differences between reward functions
J Skalse, L Farnik, SR Motwani, E Jenner, A Gleave, A Abate
arXiv preprint arXiv:2309.15257, 2023
72023
Towards guaranteed safe AI: a framework for ensuring robust and reliable AI systems (2024)
D Dalrymple, J Skalse, Y Bengio, S Russell, M Tegmark, S Seshia, ...
arXiv preprint arXiv:2405.06624, 0
7
Quantifying the sensitivity of inverse reinforcement learning to misspecification
J Skalse, A Abate
arXiv preprint arXiv:2403.06854, 2024
52024
A general framework for reward function distances
E Jenner, JMV Skalse, A Gleave
NeurIPS ML Safety Workshop, 2022
52022
The reward hypothesis is false
JMV Skalse, A Abate
52022
On the expressivity of objective-specification formalisms in reinforcement learning
R Subramani, M Williams, M Heitmann, H Holm, C Griffin, J Skalse
arXiv preprint arXiv:2310.11840, 2023
42023
All’s well that ends well: Avoiding side effects with distance-impact penalties
C Griffin, JMV Skalse, L Hammond, A Abate
NeurIPS ML Safety Workshop, 2022
22022
A General Counterexample to Any Decision Theory and Some Responses
J Skalse
arXiv preprint arXiv:2101.00280, 2021
22021
Safety Properties of Inductive Logic Programming.
G Leech, N Schoots, J Skalse
SafeAI@ AAAI, 2021
22021
Hệ thống không thể thực hiện thao tác ngay bây giờ. Hãy thử lại sau.
Bài viết 1–20