source: arxiv machine learning: models take notes at prefill: kv cache can be editable and composable

level: research

prefix caching reuses prefill only across exactly shared prefixes, so one changed field invalidates the entire downstream cache. overwriting the field's own key/value vectors and reusing the rest leaves the model acting on the old value. causal analysis across four model families shows that during prefill, the model writes field-conditioned conclusions onto downstream notes, with the field's own key/value driving under 1% of the decision.

this notebook view enables two capabilities. first, it is editable: a salient erratum amends the notes. with chain-of-thought, editing the field alone recovers the decision at 1.00 accuracy for 8b models using about 1% compute, while without chain-of-thought it is ignored. second, it is composable: the notes are position-portable, so a precompiled skill can be rope-repositioned and spliced into any context, indistinguishable from full recompute with logit cosine 0.90 to 0.999 across twelve models, at linear rather than quadratic cost.

the work reinterprets the kv cache not as raw key/value pairs but as a record of memoized intermediate conclusions. this shift allows selective updates and modular reuse, reducing compute for long-context tasks. experiments confirm that editing with chain-of-thought reliably propagates changes, and that spliced skills maintain high fidelity without full recomputation.

why it matters: this approach reduces compute for updating or reusing long-context prompts, making large language model inference more efficient for dynamic or modular tasks.


source: arxiv machine learning: models take notes at prefill: kv cache can be editable and composable