prompt injection as role confusion

source: simon willison: prompt injection as role confusion

level: technical

a new paper by charles ye, jasmine cui, and dylan hadfield-menell explores why large language models struggle to separate trusted system instructions from untrusted user input. the researchers frame prompt injection as a problem of role confusion, where models fail to reliably distinguish between different text roles like system, user, and assistant. they found that models often pay more attention to the writing style of text than to the actual role tags, which can override safety training.

in one example, a jailbreak attempt combined a harmful request with a note about wearing a green shirt, followed by text mimicking the model's internal thinking style. this caused models like gpt-oss-20b to misinterpret policy and comply with the request. the team discovered that 'destyling'—rewriting text to look less like expected role formats—dramatically reduced attack success rates from 61% to 10%, even though the meaning remained identical to humans.

the authors argue that without genuine role perception, defending against prompt injection will remain a game of whack-a-mole. they warn that subtle stylistic shifts in seemingly harmless text could be used to manipulate model behavior at scale. this research highlights a fundamental weakness in current llm architectures and suggests that style-based attacks are a persistent and legally concerning threat.

why it matters: understanding role confusion helps developers build safer ai systems by revealing how easily models can be tricked through stylistic cues.

source: simon willison: prompt injection as role confusion