ai models fail classic attention test on longer tasks

source: sciencedaily ai: a classic brain test exposed ai's biggest weakness

level: research

researchers gave several leading ai models a classic psychology experiment called the stroop task. the test asks participants to name the ink color of words like "red" or "blue" while ignoring the word itself. when the word and color conflict, it requires suppressing the automatic habit of reading. the ai systems performed well on short lists of five color words, but their accuracy dropped sharply as lists grew longer.

gpt-4o scored 91% on five-word lists but fell to 57% at ten words and just 15% at forty words. claude 3.5 sonnet stayed stable up to twenty words then declined to 24% at forty. similar patterns appeared in gpt-5, claude opus 4.1, and gemini 2.5. when matching and mismatched words mixed together, performance worsened further, with accuracy on mismatched items nearing zero. the models defaulted to reading words instead of naming ink colors, unable to maintain the instruction.

humans typically sustain high accuracy on long stroop lists despite a natural bias toward reading. the study highlights a gap between human and machine attention. while ai can mimic human language, its attention mechanisms differ fundamentally. the performance collapse points to limits in current large language models when tasks demand resisting distractions over extended sequences.

why it matters: this reveals that ai systems may be unreliable for tasks requiring sustained focus, such as monitoring long data streams or following complex instructions over time.

source: sciencedaily ai: a classic brain test exposed ai's biggest weakness