Factored Cognition Strengthens Monitoring and Thwarts Attacks

Aaron Sandoval was mentored by Cody Rushing and did research on AI control.

Summary

We develop AI control protocols which use factored cognition to reduce risks from scheming AIs. In our proposed family of factored cognition protocols, tasks are decomposed into self-contained child tasks, which are each solved in isolated contexts by other control protocols. The child task solutions are then reassembled to solve the original task. We hypothesize four separate safety benefits of these protocols:

  1. Simplicity benefits for monitors: Monitoring is more effective on simpler child tasks due to reduced complexity, making attacks and honest solutions more recognizable.

  2. Simplicity benefits for other blue team tactics: Many other blue team tactics are easier when applied to simpler child tasks, such as editing solutions and generating unit tests.

  3. Attack coordination challenges: Isolated contexts hinder untrusted models’ ability to coordinate multi-step attacks which occur across multiple areas of the solution.

  4. Security through obscurity: For some threats, isolated contexts hide information about the schemer’s target, making it harder to assess attack opportunities and reducing attack success rate.

Our experiments directly address hypotheses 1 and 4. We perform red-teaming in the APPS setting to assess the safety of a factored cognition protocol which uses the trusted model to decompose tasks and trusted monitoring protocols to solve child tasks. Due to the limitations of APPS and the limited ability of current LLMs to decompose tasks, we perform separate evaluations of our protocol’s safety and its usefulness relative to trusted monitoring. Our main findings are:

  1. Adding factored cognition to a trusted monitoring protocol boosts safety from 41% to 70% due in part to improved monitor performance.

  2. Attacks on simpler child tasks are more obvious to monitors, but we find no evidence that honest solutions are more recognizable.

  3. Adding factored cognition drops the attack success rate from 26% to 14%, but we are as yet unsure how much of this drop is due to security through obscurity.

  4. Factored cognition protocols are most useful at defending against the class of threats where harm can only be caused after a malicious final answer is accepted, as opposed to harms which occur as side effects while a solution is being computed.

  5. Factored cognition protocols have the best relative advantage in complex tasks which get decomposed by default, such as those already handled by LLM agents.

  6. The APPS setting is inadequate to properly benchmark factored cognition control protocols against simple baselines. A control setting with harder, more complex tasks is needed for a unified comparison of safety and usefulness. Testing these protocols in such a setting is the first priority for future research.

Previous
Previous

Benchmarking Deception Probes via Black-to-White Performance Boosts

Next
Next

Safety Features for a Centralised AGI Project