Stefan Heimersheim

Projects

Research Direction: Mechanistic Interpretability

I'm excited to mentor both, fundamental and applied mechanistic interpretability projects. The former could be

  • Why are there "stable regions" in activation space, and can this tell us about features and representation in neural networks?

  • Can we create useful toy models for mechanistic interpretability? In particular, I'm currently thinking of toy models of modularity.

In general, I want such projects to be aimed at solving a problem in interpreting LLMs (plus reward models, etc.) because this is what matters in practice (in the short term). I do work with non-LLM toy models, but only when I expect the insights to be useful on the path to interpreting state-of-the-art models. I am less interested in Sparse Autoencoder projects (e.g. apply SAEs to <new model type>) unless there's a reason why we expect this to be useful.

I'm also excited to work out new project ideas together with promising candidates.

You can find a list of project ideas here.

What I'm looking for in a Mentee

I typically run projects with a team of 1-4 mentees. I will have a regular weekly 1h call with the team to answer questions, discuss progress, and give advice on the next steps. We also communicate via Slack, and I typically can respond to questions within 24h (faster to urgent and blocking issues). I encourage mentees to share research results, plots etc. throughout the week.

I expect mentees to lead the research project, running the experiments and writing up the results. I encourage mentees to take agency, thinking about next steps or new directions, and to critically challenge my project plans. I’m always happy to give feedback on this. I expect mentees to prepare our weekly meeting, and to coordinate within their team. Usually this has the form of a simple slide deck (see here for what I’m aiming for) with a recap of last week’s goals, results from the current week, questions and surprises, and ideas for next week. I expect mentees to meet internally at least once a week to update each other and to prepare for the weekly call, and recommend running coworking sessions throughout the week.

Bio

I’m an interpretability researcher at FAR.AI. Previously, I worked at Apollo Research, completed a PhD in Astronomy and Neel Nanda’s MATS program. My research interests lie mostly in fundamental mech interp, new methods & good toy models. You can find some of my takes in my short-form comments. I try to focus on neglected ideas, which just means I avoid doing plain SAE projects. You can find my past projects on LessWrong as well as arXiv. I’ve mentored around 9 projects across various programs over the past year [1,2,3,4,5, more in preparation], including LASR, SPAR, MARS, and Pivotal.

Previous
Previous

Ben Bucknall

Next
Next

Lucius Caviola