Noah Y. Siegel

3 Nov

Project

Research Direction: Understanding Explanatory Faithfulness

In order to oversee advanced AI systems, it’s important to understand their underlying decision-making process. When prompted, large language models (LLMs) can provide natural language explanations or reasoning traces that sound plausible and receive high ratings from humans. However, it is unclear to what extent these explanations are faithful, i.e., truly capture the factors responsible for the model’s predictions.

The field has discovered many instances where CoTs can be unfaithful, i.e. where models fail to mention factors which influence their decisions (https://arxiv.org/abs/2305.04388, https://arxiv.org/abs/2503.08679, https://arxiv.org/abs/2505.05410). But despite these cases, more recent work finds that CoTs seem to provide valuable monitorability, even when models are instructed to hide (https://arxiv.org/abs/2507.05246). I'm interested in expanding on this direction: how can we better measure the reliability of CoT monitoring?

In my research, I’ve attempted to quantify the degree to which LLM explanations exhibit faithfulness via counterfactual interventions, in particular random word insertions (https://arxiv.org/abs/2404.03189), and the way in which model scale and instruction-tuning impacts this faithfulness (https://arxiv.org/abs/2503.13445). Other recent work in this line has attempted to expand beyond simple word insertions into LLM-generated interventions, surfacing some interesting trends (https://arxiv.org/abs/2504.14150).

Some potential projects I'm interested in:

1. Can we expand the range of counterfactual interventions beyond approaches based on word insertion, e.g. by using LLMs to generate interventions on higher-level concepts? Can we do this in a general, domain-invariant way?

2. Can we create model organisms with compex misaligned behaviors, of the kind which are likely to require reasoning to execute? How much optimization pressure breaks CoT monitoring?

3. Can we create faithfulness benchmarks involving tasks where models are most likely to have incentives to hide their biases?

4. Can we use mech interp methods to verify whether a model's natural language reasoning corresponds to its internal process?

What I'm looking for in a Mentee

I'm looking for mentees with research experience, who can work effectively as part of a small team (2-3), and who are excited about improving AGI safety!

Bio

Noah is a research engineer at Google DeepMind working with the AGI Safety and Alignment Team in London. He’s worked on measuring faithfulness in LLM self-explanations (https://arxiv.org/abs/2404.03189, https://arxiv.org/abs/2503.13445), scalable oversight via debate (https://arxiv.org/abs/2407.04622), and process-based vs. outcomes-based feedback for answering math word problems (https://arxiv.org/abs/2211.14275).

Noah previously worked on RL for robotic control at DeepMind, and computer vision for analyzing scientific papers at AI2 in Seattle.

AI ControlCoT Faithfulness

Tobias Häberli

Noah Y. Siegel

Project

What I'm looking for in a Mentee

Bio

Elliott Thornley

Jesse Hoogland