reusable safety adapters for fine-tuned language models

source: arxiv artificial intelligence: safegene: reusable adapters for transferable safety alignment

level: research

open-weight large language models are often fine-tuned for specific tasks, but this process can weaken their safety alignment, making them more likely to respond to harmful prompts. even when the fine-tuning data is not intentionally malicious, the model's ability to refuse unsafe requests can degrade. this creates a recurring problem: every time a model is updated with new data, its safety must be recovered, which is costly and time-consuming.

safegene addresses this by treating safety as an independent, reusable component. instead of repairing each model individually, it extracts a safety adapter from the difference between an aligned model and its degraded version. this adapter is refined using data-aware layer selection to capture transferable safety features. it can then be applied to any fine-tuned model within the same architecture family using a few-shot layer-wise coefficient adjustment, restoring safety without altering the task-specific knowledge.

the approach decouples safety from task updates, allowing the same adapter to be reused across multiple downstream models. this reduces the need for repeated safety training and makes it easier to maintain safe behavior as models evolve. safegene's method is designed to be efficient, requiring only a small number of examples to adapt the safety vectors to new tasks, which is practical for real-world deployment where models are frequently updated.

why it matters: it provides a practical way to maintain ai safety in customized language models without expensive retraining, making safe deployment more feasible as models are continuously fine-tuned.

source: arxiv artificial intelligence: safegene: reusable adapters for transferable safety alignment