minimax pac bounds for learning in exogenous contextual mdps

source: arxiv statistics ml: minimax pac bounds for learning in exogenous contextual mdps

level: research

this paper studies reinforcement learning in tabular discounted markov decision processes with exogenous i.i.d. contexts. at each step, a context is drawn independently from an unknown distribution and revealed before the agent acts. the context can affect rewards and transitions but is not controlled by the agent. the learner may access sampling oracles for the context distribution, the transition kernel, or both, either before or during policy execution.

the sample complexity is measured by a pair (n, m), where n is the number of oracle calls before execution and m is the number during execution. the work provides minimax pac bounds for different oracle access regimes. when rewards and transitions are known and only the context distribution is unknown, the bounds show how many samples are needed to learn a near-optimal policy. when transitions are also unknown, the complexity increases, and the paper characterizes the trade-off between pre-execution and during-execution samples.

the results give tight upper and lower bounds on the number of samples required to achieve a given accuracy with high probability. the analysis covers both the case where the agent can sample from the context distribution and the case where it can sample transitions conditioned on state-context-action tuples. these bounds help understand the fundamental difficulty of learning in environments with exogenous randomness and guide algorithm design.

why it matters: these bounds clarify the sample cost of learning in contextual environments, helping practitioners decide how to allocate data collection before and during deployment.

source: arxiv statistics ml: minimax pac bounds for learning in exogenous contextual mdps