localized multidirectional correction for refusal suppression in moe models

source: arxiv statistics ml: lomc: localized multidirectional correction for refusal suppression in routed foundation models

level: research

researchers propose localized multidirectional correction (lomc), a method to reduce refusal behavior in routed mixture-of-experts (moe) and hybrid-moe foundation models. existing approaches either use broad direction edits that can harm general capabilities or expert-only edits that lack enough capacity to handle varied refusal patterns. lomc addresses this by first selecting a small set of experts and layers as an edit support, then computing prototype correction directions from refusal examples, aggregating them into layer-wise corrections, and applying rank-one updates only within that support.

the support-then-correction design uses the edit support as a structural gate, so corrections only affect computations that pass through the chosen experts. this increases correction capacity without expanding the intervention footprint. experiments on multiple moe models show lomc achieves higher refusal suppression rates than baselines while maintaining performance on standard benchmarks. the method also works across different refusal triggers and model scales, suggesting it captures shared refusal representations.

lomc requires only a small number of refusal examples to compute correction directions, making it practical for post-training safety adjustments. the localized edits avoid degrading unrelated tasks, which is a common problem with global fine-tuning or representation engineering. the approach could help deploy models that need to answer sensitive queries in controlled settings without retraining the entire system.

why it matters: it offers a precise way to adjust model refusals without hurting general performance, useful for ai safety and controlled deployment.

source: arxiv statistics ml: lomc: localized multidirectional correction for refusal suppression in routed foundation models