persona gates refusal in chat models

source: arxiv artificial intelligence: refusal lives downstream of persona in chat models

level: research

researchers studied how refusal works in instruction-tuned chat models like qwen2.5-7b-instruct and llama-3.1-8b-instruct. they found that refusal is not an isolated behavior but is controlled by a compliant persona. by extracting a persona direction and a refusal direction from the models' activation space, they showed that steering the persona toward compliance almost completely suppresses refusal. in llama, the refusal rate dropped from 97% to 2% when the compliant persona was activated.

the team then reintroduced the refusal direction at different layers. adding it back at late layers partially restored refusal, but doing so at early layers had no effect. this suggests that refusal is computed early but only expressed later, with the persona acting as a gate at the expression stage. projecting out the persona direction in a late-layer window brought refusal back to baseline levels, while projecting out a random direction did not.

these findings show that refusal is not a simple on-off switch but is deeply tied to the model's persona. treating refusal as a single isolated direction misses this dependence. the work highlights how safety mechanisms in language models are intertwined with broader behavioral traits, which could affect how we design and test alignment techniques.

why it matters: understanding that persona gates refusal helps build more reliable safety measures in ai, as interventions must account for this interaction rather than targeting refusal alone.

source: arxiv artificial intelligence: refusal lives downstream of persona in chat models