Joshua Engels & Bilal Chughtai
Projects
Research Direction: Interpretability
We’re primarily interested in more pragmatic interpretability ideas these days, although we have also listed some basic science questions we would be interested in supervising projects on as well.
Applied Interp:
Build probes that generalize better. Possible methods: Data pruning, Synthetic data, Asking an LLM to apply distribution shifts to existing data
Build a probe creation agent: Given a target behavior, the agent iterates on a probe to detect that behavior
Determine to what extent LLMs know when you mess with their CoT
Model Psychology:
Trawl arxiv / lesswrong / etc for weird model behaviors. Auto replicate these behaviors and possibly investigate weird behaviors in more depth
Take a weird behavior and study it in depth, e.g. introspection, Neural Chameleons, roleplaying, uncertainty probes
To what extent do models have consistent values? Consistent preferences? How can we find these?
Model representations:
Study in context learned representations
What I'm looking for in a Mentee
We like to iterate fast on different ideas. We think it's great if we spend each of the first few weeks trying a different idea and then choose one to dive into in more depth.
We haven't mentored for Pivotal before but think that groups of 1-3 mentees are probably best.
We like to meet with our mentees once a week, and also try to be responsive on slack to research updates.
Prior interpretability research experience isn't required, but we do think a strong research (e.g. current pursuing or have completed a PhD) or engineering (software, quant finance) background will result in the highest probability of a successful project
Bio
Joshua
I'm currently a research scientist on the Google DeepMind language model interpretability team. I'm also on leave from my PhD at MIT, where I worked in Max Tegmark's group studying language model representations, sparse autoencoders, and scalable oversight. In my free time I like running, hiking, and backpacking!
Bilal
I'm a Research Engineer on the language model interpretability team at Google DeepMind, working on trying to understand frontier AI model's cognition. I also spend a lot of time thinking about how to improve the efficiency and productivity of safety researchers. More about me here: bilalchughtai.co.uk.
