Joshua Engels & Bilal Chughtai

Projects

Research Direction: Interpretability

We’re primarily interested in more pragmatic interpretability ideas these days, although we have also listed some basic science questions we would be interested in supervising projects on as well.

  • Applied Interp:

    • Build probes that generalize better. Possible methods: Data pruning, Synthetic data, Asking an LLM to apply distribution shifts to existing data

    • Build a probe creation agent: Given a target behavior, the agent iterates on a probe to detect that behavior

    • Determine to what extent LLMs know when you mess with their CoT

  • Model Psychology:

    • Trawl arxiv / lesswrong / etc for weird model behaviors. Auto replicate these behaviors and possibly investigate weird behaviors in more depth

    • Take a weird behavior and study it in depth, e.g. introspection, Neural Chameleons, roleplaying, uncertainty probes

    • To what extent do models have consistent values? Consistent preferences? How can we find these?

  • Model representations:

    • Study in context learned representations

What I'm looking for in a Mentee

We like to iterate fast on different ideas. We think it's great if we spend each of the first few weeks trying a different idea and then choose one to dive into in more depth.

  • We haven't mentored for Pivotal before but think that groups of 1-3 mentees are probably best.

  • We like to meet with our mentees once a week, and also try to be responsive on slack to research updates.

  • Prior interpretability research experience isn't required, but we do think a strong research (e.g. current pursuing or have completed a PhD) or engineering (software, quant finance) background will result in the highest probability of a successful project

Bio

Joshua

I'm currently a research scientist on the Google DeepMind language model interpretability team. I'm also on leave from my PhD at MIT, where I worked in Max Tegmark's group studying language model representations, sparse autoencoders, and scalable oversight. In my free time I like running, hiking, and backpacking!

Bilal

I'm a Research Engineer on the language model interpretability team at Google DeepMind, working on trying to understand frontier AI model's cognition. I also spend a lot of time thinking about how to improve the efficiency and productivity of safety researchers. More about me here: bilalchughtai.co.uk.

Previous
Previous

Peter Peneder & Jasper Götting

Next
Next

Prof. Robert Trager