risk-aware policy learning for offline bandits

source: arxiv statistics ml: pessimistic risk-aware policy learning in contextual bandits

level: research

offline policy learning from logged data is hard when you care about risk, not just average reward. most existing work focuses on expected outcomes or only evaluates risk, without optimizing for it. this paper tackles risk-aware optimization directly, using a broad class of risk functionals that are lipschitz-continuous. these include mean-variance, entropic risk, and conditional value-at-risk (cvar). the goal is to find a decision rule that minimizes risk, which is vital in high-stakes settings like healthcare or finance where bad outcomes must be avoided.

the authors build a distributional framework that estimates the full cumulative distribution function of the reward under a policy, using importance sampling from logged data. they then apply the risk functional to this estimated distribution. to make the optimization practical, they derive new concentration inequalities that bound the estimation error uniformly over policies. this leads to a data-dependent suboptimality bound that shrinks at a rate of roughly 1 over the square root of the sample size, without needing strong uniform assumptions.

the method is pessimistic: it adds a penalty based on the uncertainty in the distribution estimate, which encourages safer policies when data is limited. experiments on synthetic and real datasets show that the approach effectively controls risk, outperforming baselines that only optimize expected reward or use simpler risk proxies. the framework is flexible and can handle many risk measures, making it a general tool for offline risk-sensitive learning.

why it matters: this work lets practitioners learn policies that explicitly control downside risk from historical data, crucial for deploying ai in safety-critical applications.

source: arxiv statistics ml: pessimistic risk-aware policy learning in contextual bandits