Peter Hase

Projects

Research Direction: Interpretability for monitoring and steering LLMs

  • CoT faithfulness methods. Can we make model CoT more monitorable by training models to faithfully report reward hacking behaviors in their CoT? See: https://arxiv.org/pdf/2602.20710

  • Goal introspection methods. Can we train models to report goals they have developed during earlier training steps (SL or RL)? See: https://arxiv.org/pdf/2510.05092

  • Interpretability for training. Can we use probes for truthfulness/honesty/faithfulness as reward models to steer model behavior? See: https://arxiv.org/pdf/2602.15515

  • Unsupervised elicitation. Can we elicit model “beliefs” without supervision by training models to make consistent claims about the world? See: https://arxiv.org/pdf/2506.10139v1

What I'm looking for in a Mentee

  • You have the following technical background: familiarity with Python and Torch, have trained a deep learning model, have sampled from a language model locally (not just via API), and have done some scientific writing (e.g., a technical report or paper)

  • You do not need prior AI safety or interpretability research experience

  • You want to produce a research artifact (blog, paper, codebase) you can share with others

What I'm Like as a Mentor

  • In a normal week, I like to meet with my mentees once for a full hour to discuss research results, blockers, and plan out next steps. Outside of meetings, I try to be readily available over slack/email to answer questions as they arise. 

  • When spinning up or preparing a paper submission, I try to be available daily for feedback and input.

  • I try to encourage my mentees to be creative and independent. I want them to feel ownership of the research they do.

Bio

Peter is a Postdoc at Stanford University and an AI Institute Fellow at Schmidt Sciences. His research focuses on LLM safety and interpretability. Previously, he has worked at Anthropic, Google, Meta, and the Allen Institute for AI.

Previous
Previous

Cozmin Ududec

Next
Next

Jide Alaga