Damiano Fornasiere & Gaël Gendron

InterpretabilityEvaluations

15 Apr

Written By Tobias Häberli

Projects

Research Direction: Introspection & contextualization: causes, implication, & learning

Causes of introspection. We have preliminary empirical evidence that the mechanisms allowing language models to answer different kinds of introspective questions share a common root. We would like to validate our hypothesis both theoretically and empirically.
Introspection and learning. We have shown that language models can in-context learn to answer introspective questions, even beyond steering vectors and when reasoning is involved. Might this extend to prospective learning (the ability to execute actions based on future predictions---and in particular predict how other agents will react to one's actions)? What does this imply for, e.g., alignment and cooperation?
Source-aware training is known to improve LLMs transparency, verifiability and helps steering. Similar techniques (e.g., inoculation prompting) are now used to mitigate specification gaming as well as emergent misalignment . We are studying how [re]contextualization helps with generalization and alignment more generally.

What I'm looking for in a Mentee

Style:

I appreciate clear communication and clean math / code / writing, yet the road to reach there can be messy.
I am happy to supervise people at different stages of career / levels of independence / backgrounds, yet the propensity to be scientific, precise, and excited should be there.

Essential knowledge:

Foundations of machine- and deep-learning;
Transformer architecture and large language models;
Empirical AI safety literature (e.g., evaluations, guardrails, interpretability, …).

Essential experience:

Python;
Designing and implementing machine learning workflows using PyTorch;
Implementing evaluation protocols for language models;
Supervised- or RL-fine tuning of language models, at least with toy experiments and some publicly available datasets.

Bonus:

Prompt and context engineering, RAG;
Experience with libraries such as vLLM, TRL, Hugging Face transformers, accelerate and/or PEFT;
Familiarity with statistical hypothesis testing.

What I'm Like as a Mentor

Supervision is well-scoped at the beginning while, towards the second half, the mentees have more freedom to self-direct. I am excited to sanity-check results, including the math/code/plots.
At least 1h meeting / week. Always available via Slack and Google chat.

Bio

Damiano is a research scientist at LawZero, where he works on (i) introspection, (ii) evaluation- and training-awareness, (iii) the math behind the Scientist AI. MATS and SPAR mentor, MATS alumnus, PhD in mathematics.

Gaël is a research scientist at LawZero. His work focuses on improving trustworthiness and factuality in frontier models and the ScientistAI. PhD in deep learning and causality theory.

Cooperative AIInterpretabilityEvaluations

Tobias Häberli

Damiano Fornasiere & Gaël Gendron

Projects

What I'm looking for in a Mentee

What I'm Like as a Mentor

Bio

Kevin Wei

Adam Kaufman