Activation Probes Are Reliable With a Handful of Positive Examples

Riya Tyagi, mentored by Stefan Heimersheim.

Summary

Efforts to monitor advanced AI for rare misalignments face a data challenge: abundant aligned examples but only a handful of misaligned ones. We test activation probes in this "few vs. thousands" regime on spam and honesty detection tasks. For our tasks, training with many negative examples is on average more positive-sample-efficient than balanced training for small numbers (1-10) of positive samples. We also find that LLM upsampling can provide a performance boost equivalent to roughly doubling the number of real positive samples, though excessive upsampling hurts performance. Finally, we show a positive scaling trend, where larger models are more positive-sample-efficient to probe. Our findings suggest we should leverage the large number of negative samples available to amplify the signal from rare but critical misalignment examples.

Full paper
Previous
Previous

Influence Dynamics and Stagewise Data Attribution

Next
Next

Bayesian Influence Functions for Hessian-Free Data Attribution