The Multi-Agent Off-Switch Game

Soroush Ebadian was mentored by Lewis Hammond.

Summary

The off-switch game framework has been instrumental in understanding corrigibility — the property that AI agents shouldallow human oversight and intervention. In single-agent set-tings, uncertainty about human preferences naturally incen-tivizes agents to defer to human judgment. However, as AIsystems increasingly operate in multi-agent environments,a crucial question arises: does corrigibility compose acrossmultiple agents? We introduce the multi-agent off-switchgame and demonstrate that individually corrigible agentscan become collectively incorrigible when strategic interac-tions are considered. Through formal analysis and illustra-tive examples, we show that corrigibility is not composi-tional and identify conditions under which group incorrigi-bility emerges. Our results highlight fundamental challengesfor AI safety in multi-agent settings and suggest the need fornew approaches that explicitly address collective dynamics.

Previous
Previous

Factor(T,U): Factored Cognition Strengthens Monitoring of Untrusted AI

Next
Next

Decomposition of Small Transformer Models