Peter Hase

InterpretabilityAlignmentControl

15 Apr

Written By Tobias Häberli

Projects

Research Direction: Interpretability for monitoring and steering LLMs

CoT faithfulness methods. Can we make model CoT more monitorable by training models to faithfully report reward hacking behaviors in their CoT? See: https://arxiv.org/pdf/2602.20710
Goal introspection methods. Can we train models to report goals they have developed during earlier training steps (SL or RL)? See: https://arxiv.org/pdf/2510.05092
Interpretability for training. Can we use probes for truthfulness/honesty/faithfulness as reward models to steer model behavior? See: https://arxiv.org/pdf/2602.15515
Unsupervised elicitation. Can we elicit model “beliefs” without supervision by training models to make consistent claims about the world? See: https://arxiv.org/pdf/2506.10139v1

What I'm looking for in a Mentee

You have the following technical background: familiarity with Python and Torch, have trained a deep learning model, have sampled from a language model locally (not just via API), and have done some scientific writing (e.g., a technical report or paper)
You do not need prior AI safety or interpretability research experience
You want to produce a research artifact (blog, paper, codebase) you can share with others

What I'm Like as a Mentor

In a normal week, I like to meet with my mentees once for a full hour to discuss research results, blockers, and plan out next steps. Outside of meetings, I try to be readily available over slack/email to answer questions as they arise.
When spinning up or preparing a paper submission, I try to be available daily for feedback and input.
I try to encourage my mentees to be creative and independent. I want them to feel ownership of the research they do.

Bio

Peter is a Postdoc at Stanford University and an AI Institute Fellow at Schmidt Sciences. His research focuses on LLM safety and interpretability. Previously, he has worked at Anthropic, Google, Meta, and the Allen Institute for AI.

InterpretabilityAlignmentAI Control

Tobias Häberli

Peter Hase

Projects

What I'm looking for in a Mentee

What I'm Like as a Mentor

Bio

Cozmin Ududec

Jide Alaga