Dr Francis Rhys Ward

AlignmentControl

15 Apr

Projects

Research Direction: Projects in alignment science

I'm pretty interested in finding a highly independent person with existing research experience to work on open-ended questions related to the science of alignment. I'm specifically interested in developing an empirical science of goal drift (cf some discussion in 2 of this doc).

Here's an experiment:

Train a model to have context dependent goals:

In a chat context it gives truthful and honest answers
in a coding context it reward hacks a lot

In training these context are separated. What happens when a chat user asks about a coding question? How does the model generalise to mixed contexts? How do reasoning models reason about conflicting goals here and how do they resolve them?

What I'm looking for in a Mentee

For this project I'm looking for a highly independent junior researcher with at least one solid AI safety project already done. I don't care much about high scores on coding tests. You should be able to use Claude code productively (avoiding letting it bullshit you). You should have some decent familiarity with alignment and LLM training (eg Tinker). If you're excited about the type of project I described above then please apply---don't worry about being too junior.

What I'm Like as a Mentor

I'm really nice. Default is we meet for 1 hour per week and you ping me on slack as much as you want (more is better, I like a lot of communication).

Bio

I am interested in the safety of artificial general intelligence (AGI) and in mitigating catastrophic risks.

I am currently working with Redwood Research as an Astra fellow.

My research interests include the science of AI alignment, conceptual and philosophical work, theoretical research on agents and deception, empirical evaluations of frontier AI systems, and AI control. My PhD focused on formalising and evaluating AI deception.

I am a member of the Causal Incentives Working Group, the FLI AI Existential Safety Community, a scientific advisor to LawZero, and a MATS mentor. In the past I gave lectures on the Ethics, Privacy, AI in Society module at Imperial.

Previously I have worked at the Centre for Assuring Autonomy, the Center on Long-Term Risk, the Centre for the Governance of AI, the UK’s AI Security Institute, and LawZero.

AI ControlAlignment

Tobias Häberli

Dr Francis Rhys Ward

Projects

What I'm looking for in a Mentee

What I'm Like as a Mentor

Bio

Samira Nedungadi & Jasper Götting

Konstantinos Voudouris