2,000 people tried to hack an ai assistant and failed

source: simon willison: what happened after 2,000 people tried to hack my ai assistant

level: technical

fernando irarrázaval ran a challenge on hackmyclaw.com where participants tried to leak secrets from his openclaw test instance by sending it email. after 6,000 attempts, costing $500 in token spend and triggering a google account suspension from too many inbound emails, nobody managed to leak the secret. the model used was opus 4.6 with a prompt that explicitly forbade revealing secrets, modifying files, executing commands, or exfiltrating data based on email content.

simon willison notes that this aligns with his own observations: frontier model labs have been training their systems to resist injection attacks, as mentioned in the gpt-5.6 system card. these efforts appear to make such attacks much harder to pull off. however, he cautions against deploying production systems where prompt injection could cause irreversible damage, since 6,000 failed attempts do not guarantee that a more sophisticated approach would not succeed.

the hacker news discussion on this challenge was full of skepticism and good-faith replies from fernando. the experiment demonstrates that current ai assistants can be hardened against common prompt injection tactics, but the security community remains cautious. the results suggest that while basic attacks are thwarted, the risk of novel or advanced exploits persists, and rigorous testing is essential before trusting ai agents with sensitive operations.

why it matters: it shows that ai assistants can resist common prompt injection attacks, but security is not absolute, so data scientists should still avoid exposing sensitive systems to untrusted inputs.

source: simon willison: what happened after 2,000 people tried to hack my ai assistant