Adam Kaufman

Control

15 Apr

Written By Tobias Häberli

Projects

Research Direction: High-stakes AI control techniques and evaluations

Category 1: Building better control settings/evals

This might look like extending existing benchmarks such as BashArena or LinuxArena in order to facilitate different types of evaluations
Or building new control settings from scratch (e.g. using datasets of existing vulnerabilities)
Or building auxiliary evaluations to study control-relevant properties of models (e.g. steganography, secret keeping, experiments similar to https://arxiv.org/abs/2412.12480) without doing full control evaluations

Category 2: Developing and/or red-teaming control techniques using existing settings

Techniques leveraging interrogation or consistency checks
Building tools for monitors / allowing monitors to interact with the environment
Training trusted monitors
Red-teaming CoT monitoring in control settings

This isn't an exhaustive list of possible projects. This post contains some more information and additional project proposals; some of them are now outdated, but others are still relevant/open problems: https://www.lesswrong.com/posts/RRxhzshdpneyTzKfq/recent-redwood-research-project-proposals.

What I'm looking for in a Mentee

In this day and age, SWE skills are much less important than:

A scientific mindset / experience with the scientific method (e.g. tracking which questions we're trying to answer and coming up with experiments to address them as efficiently as possible, having predictions for how results will come out and noticing confusion when surprising things happen)
Conscientiousness (e.g. not forgetting important details when performing experiments that involve lots of variables, having a good system for keeping track of past results and ideas)
A solid understanding of why it's scary to build powerful AIs, what makes alignment difficult, and the core ideas and methodology from the AI control agenda. (The last one is the easiest to get up to speed on, but more familiarity is better).

What I'm Like as a Mentor

I tend to make lots of suggestions for experiments to try when faced with a round of results. These should usually be treated as weak suggestions, and I trust mentees to decide which ones are worth pursuing unless I try to make a stronger bid for pushing in a specific direction. I'd be happy to provide at least one round of pair-programming/watching as you work in order to suggest strategies/tools you might be unaware of, but I'm generally hands-off when it comes to implementation details so long as I trust that you know what you're doing.

Bio

I am a member of technical staff at Redwood Research, focused on developing techniques (https://www.bashcontrol.com/) and evaluations (https://www.bash-arena.com/) for controlling untrusted AIs in high-stakes deployments. I joined Redwood full time after participating in summer 2024 MATS mentored by Buck Shlegeris. Before that, I was a physics student at Harvard and a member of the Harvard AI Safety Team (now AISST).

AI Control

Tobias Häberli

Adam Kaufman

Projects

What I'm looking for in a Mentee

What I'm Like as a Mentor

Bio

Damiano Fornasiere & Gaël Gendron

Cozmin Ududec