Stefan Heimersheim

Interpretability

3 Nov

Written By Tobias Häberli

Projects

Research Direction: Mechanistic Interpretability

I'm excited to mentor both, fundamental and applied mechanistic interpretability projects. The former could be

Why are there "stable regions" in activation space, and can this tell us about features and representation in neural networks?
Can we create useful toy models for mechanistic interpretability? In particular, I'm currently thinking of toy models of modularity.

In general, I want such projects to be aimed at solving a problem in interpreting LLMs (plus reward models, etc.) because this is what matters in practice (in the short term). I do work with non-LLM toy models, but only when I expect the insights to be useful on the path to interpreting state-of-the-art models. I am less interested in Sparse Autoencoder projects (e.g. apply SAEs to <new model type>) unless there's a reason why we expect this to be useful.

I'm also excited to work out new project ideas together with promising candidates.

You can find a list of project ideas here.

What I'm looking for in a Mentee

I typically run projects with a team of 1-4 mentees. I will have a regular weekly 1h call with the team to answer questions, discuss progress, and give advice on the next steps. We also communicate via Slack, and I typically can respond to questions within 24h (faster to urgent and blocking issues). I encourage mentees to share research results, plots etc. throughout the week.

I expect mentees to lead the research project, running the experiments and writing up the results. I encourage mentees to take agency, thinking about next steps or new directions, and to critically challenge my project plans. I’m always happy to give feedback on this. I expect mentees to prepare our weekly meeting, and to coordinate within their team. Usually this has the form of a simple slide deck (see here for what I’m aiming for) with a recap of last week’s goals, results from the current week, questions and surprises, and ideas for next week. I expect mentees to meet internally at least once a week to update each other and to prepare for the weekly call, and recommend running coworking sessions throughout the week.

Bio

I’m an interpretability researcher at FAR.AI. Previously, I worked at Apollo Research, completed a PhD in Astronomy and Neel Nanda’s MATS program. My research interests lie mostly in fundamental mech interp, new methods & good toy models. You can find some of my takes in my short-form comments. I try to focus on neglected ideas, which just means I avoid doing plain SAE projects. You can find my past projects on LessWrong as well as arXiv. I’ve mentored around 9 projects across various programs over the past year [1,2,3,4,5, more in preparation], including LASR, SPAR, MARS, and Pivotal.

Mechanistic Interpretability

Tobias Häberli

Stefan Heimersheim

Projects

What I'm looking for in a Mentee

Bio

Ben Bucknall

Lucius Caviola