source: kdnuggets: why do llms corrupt your documents when you delegate?
level: technical
a new study introduces delegate-52, a benchmark spanning 52 professional domains, to test how large language models handle document editing tasks. researchers used a round-trip method: they asked models to make an edit, then to reverse it, and checked if the original document was restored. even top models like gemini pro, claude opus, and gpt-5 corrupted about 25% of content after 20 interactions. weaker models corrupted nearly half.
the study identifies several reasons for this decay. errors compound over multiple edits, like a game of telephone. weaker models tend to delete content, making corruption obvious as documents shrink. frontier models instead fabricate plausible but incorrect details while keeping word count and structure intact, making errors hard to spot. context overload from large documents or extra attachments worsens degradation, as models guess rather than adhere to source text. domain familiarity matters: models perform better on structured tasks like python code than on natural language or niche formats.
adding agentic tools, such as code execution or file access, does not fix the problem. the corruption stems from the transformer architecture itself. the study suggests that using llms as fully unsupervised document editors remains risky, and new verification methods are needed for long-horizon ai tasks.
why it matters: for ai and data science, this reveals a critical reliability gap in using llms for automated document workflows, urging practitioners to build safeguards against silent data corruption.
source: kdnuggets: why do llms corrupt your documents when you delegate?