Elliott Thornley

AI Alignment

3 Nov

Written By Tobias Häberli

Project

Research Direction: Constructive Decision Theory

Deliberately training reward-seekers: We’ll deliberately train LLMs to seek rewards. We’ll test how far their reward-seeking generalizes, and we’ll see if we can use their reward-seeking to elicit their capabilities, stop them sandbagging, and keep them under control. We'll write up our results in a paper and submit it to top ML conferences like NeurIPS.

Testing the POST-Agents Proposal: We'll train LLMs to satisfy Preferences Only Between Same-Length Trajectories (POST). Prior theory suggests that this will incline them to satisfy Neutrality+: the agent maximizes expected utility, ignoring the probability distribution over trajectory-lengths. We'll test whether this is true in our setting.
Assigning probabilities: We'll use ideas from decision theory to train LLMs to assign certain probabilities to certain statements. We'll test how effective this is, and how far the LLMs' probability-assignments generalize.

What I'm looking for in a Mentee

I'm looking for mentees with experience finetuning LLMs. It'd be great if you have previously published work in machine learning.

Bio

I'm a postdoctoral associate at MIT, working on AI safety. I work on constructive decision theory: using ideas from decision theory to design and train artificial agents, principally with the aim of making them safer and more reliable. I do both theory and experiments. See for example https://arxiv.org/pdf/2505.20203 and https://arxiv.org/pdf/2407.00805.

AI AlignmentDecision TheoryAI Agents

Tobias Häberli

Elliott Thornley

Project

What I'm looking for in a Mentee

Bio

Tyler Tracy

Noah Y. Siegel