behavior-aware corrections stabilize off-policy td learning

source: arxiv artificial intelligence: behavior-aware auxiliary corrections for off-policy temporal-difference prediction

level: research

temporal-difference learning with function approximation often becomes unstable when data comes from a different policy than the one being evaluated. the tdc algorithm fixes this by adding an auxiliary correction based on feature covariances, and tdrc adds regularization to make it work in a single timescale. this paper explores what happens when the auxiliary matrix is changed to use the behavior policy's bellman matrix instead of the standard covariance matrix.

the authors propose two new methods: ba-tdc and ba-tdrc. ba-tdc swaps the auxiliary matrix in tdc for the behavior bellman matrix, while ba-tdrc applies the same swap to the regularized tdrc. by separating the geometry change from the regularization, the linear analysis shows how each part affects stability and convergence. the behavior-aware geometry can better reflect the actual data distribution, potentially reducing variance and bias in the value estimates.

the work focuses on linear function approximation, which serves as a simple model for understanding more complex neural network value functions. the findings suggest that choosing the right auxiliary geometry matters for off-policy learning, and the behavior bellman matrix is a promising alternative. this could guide the design of auxiliary corrections in deep reinforcement learning, where feature covariances are harder to estimate reliably.

why it matters: better auxiliary corrections can make off-policy reinforcement learning more stable and data-efficient, which is crucial for training ai agents from diverse or historical data.

source: arxiv artificial intelligence: behavior-aware auxiliary corrections for off-policy temporal-difference prediction