Dylan Hadfield-Menell

AlignmentInterpretabilityTechnical Governance

11 Nov

Projects

Research Direction: Moving beyond the post-training frame for alignment: interpretability, in-context alignment, and institutions

Developing and applying sparse autoencoder (SAE)-based interpretability tools to characterize and steer model behaviors.
Creating tools to facilitate intuitive specification of custom behaviors and preferences in AI models.
Exploring "in-context alignment" methods that enable open-ended agents to track and reason about human preferences and uncertainty (e.g., Open-Universe Assistance Games).
Designing democratic institutions to collectively specify, audit, and enforce societal alignment standards for deployed AI systems.

What I'm looking for in a Mentee

Ideally, you're excited to deeply explore foundational alignment questions and to iterate quickly on experimental approaches. I value collaboration that's reflective, open-ended, and ambitious; I prefer mentees comfortable navigating uncertain and interdisciplinary topics. I'm looking for researchers who are interested in AI safety and have already reached a baseline level of research maturity. Ideally, you've already seen 2-3 papers go through the paper submission pipeline and been involved in at least one paper from start to finish. As a mentor, I have two primary goals that I'm looking to help you improve: first, my goal is to help you develop research skills in the design and execution of research projects; second, I'm looking to help you develop research taste and exposure through broader discussion and interaction with my lab. My theory of research supervision is that research success is a combination of learned skills for executing research, broad exposure to develop research taste, and luck to have great ideas. You can't control when you have a great idea, but you can be ready to recognize it when it's there and execute on it well. My goal is to help you develop the first two so that you're prepare for when your big idea finds you.

Bio

I am an Associate Professor of EECS at MIT. I run the Algorithmic Alignment Group in the Computer Science and Artificial Intelligence Laboratory (CSAIL). My research develops methods to ensure that AI systems' behavior aligns with the goals and values of their human users and society as a whole, a concept known as 'AI alignment'. My group and I work to address alignment challenges in multi-agent systems, human-AI teams, and societal oversight of machine learning. Our goal is to enable the safe, beneficial, and trustworthy deployment of AI in real-world settings.

AI AlignmentInterpretabilityTechnical AI Governance

Tobias Häberli

Dylan Hadfield-Menell

Projects

What I'm looking for in a Mentee

Bio

Julian Stastny