source: arxiv machine learning: ucci: calibrated uncertainty for cost-optimal llm cascade routing

level: research

large language model cascades send easy queries to a small model and hard ones to a large model to save inference cost. most routers rely on uncalibrated confidence scores and need manual threshold tuning for each workload. this paper introduces ucci, a router that first calibrates token-level margin uncertainty into a per-query error probability using isotonic regression. it then picks an escalation threshold by minimizing cost under a quality constraint.

the method makes three explicit assumptions under which threshold policies on the calibrated score are cost-optimal. isotonic calibration achieves a sample complexity of o(n^{-1/3}) for expected calibration error. on a production named entity recognition task with 75,000 queries served by 4b and 12b instruction-tuned llms on h100 gpus, ucci reduces inference cost by 31% at micro-f1 of 0.91. the 95% confidence interval is 27% to 35%. expected calibration error drops from 0.12 to 0.03.

ucci outperforms entropy-based routing at the same operating point. the approach removes the need for per-workload threshold tuning, making it easier to deploy in production. by focusing on calibration first, the router provides reliable uncertainty estimates that directly inform cost-saving decisions. this work shows that well-calibrated probabilities can lead to simpler, more efficient llm systems.

why it matters: better calibrated uncertainty allows automatic cost-optimal routing in llm systems, reducing inference expenses without sacrificing accuracy.


source: arxiv machine learning: ucci: calibrated uncertainty for cost-optimal llm cascade routing