source: kdnuggets: a deep dive into calibration of language models: platt scaling, isotonic regression, temperature scaling

level: technical

language models often say they are 90% confident but are wrong much more often. this miscalibration is common in factual qa, code generation, and reasoning tasks. post-hoc recalibration fits a simple function on held-out data to map raw scores to better probabilities. three standard methods are temperature scaling, platt scaling, and isotonic regression. they were built for classifiers, so using them on llms needs care.

temperature scaling divides logits by a scalar t before softmax. it is cheap and keeps prediction order. for base models without rlhf, a single t often fixes systematic overconfidence. after rlhf, models show input-dependent overconfidence, so a single t fails. adaptive temperature scaling predicts per-token temperatures from hidden features and improves calibration by 10–50% without hurting performance. platt scaling fits a logistic function over scores. it works with small calibration sets and can combine multiple scores. however, it may be too coarse for tasks needing local decisions and can worsen proper scoring for strong models.

isotonic regression learns a piecewise-constant monotonic mapping. it is more flexible than platt scaling and often beats it on ece and brier score. the cost is overfitting risk when calibration data is small. for llm multiclass settings, normalization-aware extensions help. the literature leaves gaps: how platt and isotonic regression behave on post-rlhf models is untested, and direct llm benchmarks comparing all three are rare. picking the right confidence signal—token, sequence, or verbalized—is essential before applying any method.

why it matters: reliable confidence scores let practitioners trust model outputs in high-stakes tasks like medical qa or code generation.


source: kdnuggets: a deep dive into calibration of language models: platt scaling, isotonic regression, temperature scaling