Stefan Heimersheim
Projects
Research Direction: Mechanistic Interpretability
I'm most excited to mentor projects aimed to advance fundamental mechanistic interpretability, directions like:
Why are there "stable regions" in activation space (activation plateaus), and can this tell us about features and representation in neural networks?
Can we create useful toy models for mechanistic interpretability? I'm keen to identify "computation in superposition in the wild" (maybe in fact-lookup in LLMs) to help us understand whether and how computation in superposition occurs in relevant models.
With my research, I want to advance our understanding of how LLMs work, to the extent that we can accelerate progress on alignment and safety research. I believe that we likely need to make more fundamental progress before mechanistic interpretability can contribute to AI alignment progress. (Sparse dictionary learning (SDL)-based methods are the most-promising existing technique, but I try to explore uncorrelated bets in case the SDL agenda doesn't work out.)
My research typically to focuses on LLMs, or toy models if we have reason to believe that those toy models are a good model of what happens in LLMs. I want to find the right way to understand LLM representations, maybe identifying how concepts ("features") are represented in models. I'm looking for empirical artifacts of such features (e.g. my activation plateau work), or effects that are predicted from theory (e.g. error correction or spiky activations).
I'm also excited to work out new project ideas together with promising candidates.
You can find some of my past projects here.
What I'm looking for in a Mentee
I aim to run projects with a team of 2 mentees. I will have a regular weekly 1h call with the team to answer questions, discuss progress, and give advice on the next steps. We also communicate via Slack, and I typically can respond to questions within 24h (faster to urgent and blocking issues). I encourage mentees to share research results, plots etc. throughout the week.
I expect mentees to lead the research project, running the experiments and writing up the results. I encourage mentees to take agency, thinking about next steps or new directions, and to critically challenge my project plans. I’m always happy to give feedback on this. I expect mentees to prepare our weekly meeting, and to coordinate within their team. Usually this has the form of a simple slide deck (see here for what I’m aiming for) with a recap of last week’s goals, results from the current week, questions and surprises, and ideas for next week. I expect mentees to meet internally at least once a week to update each other and to prepare for the weekly call, and recommend running coworking sessions throughout the week.
Bio
I’m an interpretability researcher at FAR.AI. Previously, I worked at Apollo Research, completed a PhD in Astronomy and Neel Nanda’s MATS program. My research interests lie mostly in fundamental mech interp, new methods & good toy models. You can find some of my takes in my short-form comments. I try to focus on neglected ideas, which just means I avoid doing plain SAE projects. You can find my past projects on LessWrong as well as arXiv. I’ve mentored around 9 projects across various programs over the past year [1,2,3,4,5, more in preparation], including LASR, SPAR, MARS, and Pivotal.
