Konstantinos Voudouris
Projects
Research Direction: Scalable Oversight & Evaluation
Running debate experiments with LLMs and humans. In particular, we want to see if we can get debate to work as a scalable oversight method, and surface any problems with the debate protocol, including obfuscated arguments and judge hacking.
Systematic study of judge hacking in debate. Can models learn to hack both weak LLM and human judges when trained via debate? What mitigations can we implement to lessen this risk?
Auditing games with human participants - can LLMs learn to exploit human pseudo-randomness in auditing?
The opportunities and challenges of automating human studies in the agentic era. It is likely that we will be handing off safety research to AI agents at some point, and this will necessarily involve them running human studies. What is the current state of the art here? What are the practical and ethical challenges that we should pre-empt?
Producing better measurement models using psychometrics, in particular Item Response Theory and Ising models.
Practical recommendations for avoiding a replication crisis in the automated science era. As we hand off research to AI models, it seems probable that we will see another replication/generalisability crisis. We should pre-empt this and put the appropriate mitigations in place well in advance. One version of this project would also involve running game-theoretic simulations of science to stress-test proposed mitigations.
What I'm looking for in a Mentee
I like working with people who have complementary skillsets to my own, but are open to thinking broadly and across disciplines. I'm looking for collaborators with backgrounds in the social and cognitive sciences, but with some programming experience. PhD or postdoc level research experience is ideal.
What I'm Like as a Mentor
I like flat mentoring structures - I am a collaborator rather than a mentor. I like to help at the object-level as much as time allows, and I get into the weeds with the details. I will usually schedule one check-in meeting a week and then expect lots of ad hoc communication as and when questions arise.
Bio
Konstantinos is the cognitive scientist on the Alignment Team at the UK AI Security Institute. His research focuses on advancing the sciences of AI alignment, scalable oversight, and AI evaluation, using tools from the cognitive sciences. Combining these diverse fields allows us to build better, safer, and more human-like AI systems, as well as informed and sensible AI policy.
