Adam Kaufman
Projects
Research Direction: High-stakes AI control techniques and evaluations
Category 1: Building better control settings/evals
This might look like extending existing benchmarks such as BashArena or LinuxArena in order to facilitate different types of evaluations
Or building new control settings from scratch (e.g. using datasets of existing vulnerabilities)
Or building auxiliary evaluations to study control-relevant properties of models (e.g. steganography, secret keeping, experiments similar to https://arxiv.org/abs/2412.12480) without doing full control evaluations
Category 2: Developing and/or red-teaming control techniques using existing settings
Techniques leveraging interrogation or consistency checks
Building tools for monitors / allowing monitors to interact with the environment
Training trusted monitors
Red-teaming CoT monitoring in control settings
This isn't an exhaustive list of possible projects. This post contains some more information and additional project proposals; some of them are now outdated, but others are still relevant/open problems: https://www.lesswrong.com/posts/RRxhzshdpneyTzKfq/recent-redwood-research-project-proposals.
What I'm looking for in a Mentee
In this day and age, SWE skills are much less important than:
A scientific mindset / experience with the scientific method (e.g. tracking which questions we're trying to answer and coming up with experiments to address them as efficiently as possible, having predictions for how results will come out and noticing confusion when surprising things happen)
Conscientiousness (e.g. not forgetting important details when performing experiments that involve lots of variables, having a good system for keeping track of past results and ideas)
A solid understanding of why it's scary to build powerful AIs, what makes alignment difficult, and the core ideas and methodology from the AI control agenda. (The last one is the easiest to get up to speed on, but more familiarity is better).
What I'm Like as a Mentor
I tend to make lots of suggestions for experiments to try when faced with a round of results. These should usually be treated as weak suggestions, and I trust mentees to decide which ones are worth pursuing unless I try to make a stronger bid for pushing in a specific direction. I'd be happy to provide at least one round of pair-programming/watching as you work in order to suggest strategies/tools you might be unaware of, but I'm generally hands-off when it comes to implementation details so long as I trust that you know what you're doing.
Bio
I am a member of technical staff at Redwood Research, focused on developing techniques (https://www.bashcontrol.com/) and evaluations (https://www.bash-arena.com/) for controlling untrusted AIs in high-stakes deployments. I joined Redwood full time after participating in summer 2024 MATS mentored by Buck Shlegeris. Before that, I was a physics student at Harvard and a member of the Harvard AI Safety Team (now AISST).
