one mask to rule them all: hidden facts after editing

source: arxiv machine learning: one mask to rule them all: on hidden facts after editing and how to find them

level: research

knowledge editing methods like rome and memit change factual associations in transformer models by modifying mlp weights. while these edits are usually judged by output behavior, their internal workings are not well understood. we looked at whether different edits rely on a common mechanism, even though each edit changes different weights. we found that rome and memit target the same small set of weights that are critical for keeping edits in place.

to isolate this set, we trained a compact binary mask over the edited weights. this mask reversed 80% of edits on the training set and over 70% on the test set. this shows that many different edits share a common functional structure. the mask works by removing overattention in later layers of the model. when we injected the mask during editing, success dropped from 98% to 38%, proving that this mechanism is necessary for edits to succeed.

these results suggest that knowledge editing is not just about changing specific facts. instead, it relies on a shared internal pathway that can be controlled with a simple mask. this finding could lead to better editing tools and ways to verify or undo edits. it also raises questions about how permanent and isolated these edits really are in large language models.

why it matters: understanding the shared mechanism behind knowledge edits helps build more reliable and controllable ai systems, and may prevent unintended side effects from model updates.

source: arxiv machine learning: one mask to rule them all: on hidden facts after editing and how to find them