Logan Riggs Smith & Thomas Dooms

Interpretability

16 Apr

Projects

Research Direction: Ambitious Mech Interp w/ Tensor-transformers on Toy Languages

Many toy models don't generalize to LLMs. To handle this, we can create an increasingly realistic toy language, starting with known primitives (induction heads, skip-trigrams) in order to resolve our fundamental confusions. There are many ways to contribute:

Interp-across-time: are there any dependent structures during training (eg must learn X before learning Y)?
Building interp tools: What techniques (existing or novel) can be used to find these ground truth features?
Phenomenon Studies: use the controlled setup to characterize specific computational phenomena (suppression, error correction, compositional reuse) with ground-truth verification.
Tensor Interp: because we're using a tensor-transformer, there may be new techniques available to us (prior familiarity with tensor networks is a prerequisite for this direction)
Improve the data-generating process (DGP): Extend the DGP to include computational patterns beyond n-grams & induction (eg nested structure (brackets, quotes), long-range dependencies, context-sensitive transitions, etc) learned from existing datasets.

Full Project Link: https://docs.google.com/document/d/1x5Ahn7-XSeln5lKWw2C9aORnDASzLlRzjVPiTLi60jM/edit?usp=sharing

What we’re looking for in a Mentee

We're looking for someone who's smart and gets work done. Some independence to pursue your own experiments & hypotheses is expected. To clarify, we'll give quick feedback but will not fully scope out the entire project for you.

You do NOT need a background in tensor networks.

What we’re like as Mentors

We typically have ~1 hour weekly meetings, giving feedback on the experiments tried and ideas on what to try next. As needed, we also have 1-on-1 meetings for brainstorming ideas and to get mentees unstuck. Will respond within a day or two to any updates.

Bio

Thomas works on the scientific discovery team at Goodfire. His research focuses on interpreting models using their weights by making the model’s computation (e.g. circuits/interactions) algebraically analyzable. Treating the composition of features as first class citizens, it becomes possible to decompose rich representations into interpretable mechanisms and geometries.

Logan's been working in mech interp since 2023, co-authoring the first SAE paper. His main focus is on resolving our underlying confusions about model internals. This comes from a belief that ambitious mech interp is a verifiable task ("does the algorithm extracted match the model?"), but we need to properly define "simplicity", "features", and "circuits" first.

Interpretability

Tobias Häberli

Logan Riggs Smith & Thomas Dooms

Projects

What we’re looking for in a Mentee

What we’re like as Mentors

Bio

Robert Trager & Charles Martinet

Gabriel Kulp