Joshua Engels & Bilal Chughtai

4 Nov

Projects

Research Direction: Interpretability

We’re primarily interested in more pragmatic interpretability ideas these days, although we have also listed some basic science questions we would be interested in supervising projects on as well.

Applied Interp:
- Build probes that generalize better. Possible methods: Data pruning, Synthetic data, Asking an LLM to apply distribution shifts to existing data
- Build a probe creation agent: Given a target behavior, the agent iterates on a probe to detect that behavior
- Determine to what extent LLMs know when you mess with their CoT
Model Psychology:
- Trawl arxiv / lesswrong / etc for weird model behaviors. Auto replicate these behaviors and possibly investigate weird behaviors in more depth
- Take a weird behavior and study it in depth, e.g. introspection, Neural Chameleons, roleplaying, uncertainty probes
- To what extent do models have consistent values? Consistent preferences? How can we find these?
Model representations:
- Study in context learned representations

What I'm looking for in a Mentee

We like to iterate fast on different ideas. We think it's great if we spend each of the first few weeks trying a different idea and then choose one to dive into in more depth.

We haven't mentored for Pivotal before but think that groups of 1-3 mentees are probably best.
We like to meet with our mentees once a week, and also try to be responsive on slack to research updates.
Prior interpretability research experience isn't required, but we do think a strong research (e.g. current pursuing or have completed a PhD) or engineering (software, quant finance) background will result in the highest probability of a successful project

Bio

Joshua

I'm currently a research scientist on the Google DeepMind language model interpretability team. I'm also on leave from my PhD at MIT, where I worked in Max Tegmark's group studying language model representations, sparse autoencoders, and scalable oversight. In my free time I like running, hiking, and backpacking!

Bilal

I'm a Research Engineer on the language model interpretability team at Google DeepMind, working on trying to understand frontier AI model's cognition. I also spend a lot of time thinking about how to improve the efficiency and productivity of safety researchers. More about me here: bilalchughtai.co.uk.

Co-mentorsInterpretabilityAI Control

Tobias Häberli

Joshua Engels & Bilal Chughtai

Projects

What I'm looking for in a Mentee

Bio

Peter Peneder & Jasper Götting

Prof. Robert Trager