capability elicitation vs creation in post-training

source: arxiv artificial intelligence: on distinguishing capability elicitation from capability creation in post-training: a free-energy perspective

level: research

large language model post-training is often split into supervised fine-tuning as imitation and reinforcement learning as discovery. this view is too simple. the key question is whether training makes the model more likely to produce behaviors it could already do, or whether it enables entirely new behaviors. the paper introduces accessible support: the set of behaviors a model can practically produce given limited compute or sampling budgets. post-training that reweights behaviors inside this set is capability elicitation. changing the set itself is capability creation.

the authors use a free-energy framework to analyze post-training. both supervised fine-tuning and reinforcement learning can be seen as reweighting a pretrained reference distribution. when the target distribution stays within the model's existing accessible support, training just elicits latent capabilities. if the target distribution lies outside, training must create new capabilities by expanding the support. this distinction matters because it affects how we interpret post-training results and design training procedures.

the paper suggests that many reported gains from reinforcement learning may actually be elicitation rather than creation. it proposes diagnostic methods to tell them apart, such as checking whether behaviors appear under different sampling strategies from the pretrained model. this could lead to more efficient post-training by focusing on elicitation when possible, and reserving expensive creation steps for truly new capabilities. the framework also connects to concepts like mode collapse and the cost of exploration in reinforcement learning.

why it matters: understanding whether post-training elicits or creates capabilities helps allocate compute efficiently and interpret model improvements correctly.

source: arxiv artificial intelligence: on distinguishing capability elicitation from capability creation in post-training: a free-energy perspective