behavior-induced mirror-prox td for faster off-policy learning

source: arxiv artificial intelligence: behavior-induced mirror-prox temporal-difference learning for faster off-policy prediction

level: research

gradient temporal-difference methods are stable for off-policy prediction with linear function approximation, but their speed depends on the metric used for updates. standard mirror-prox td methods rely on the feature covariance matrix, which may not capture the best geometry for learning. this paper introduces a behavior-induced mirror-prox td method called sthtd-mp. it replaces the covariance metric with the symmetric part of the behavior-policy bellman matrix, which encodes transition dynamics from the behavior policy.

sthtd-mp uses a single learning rate for both primal and auxiliary variables, simplifying tuning. it applies a mirror-prox prediction-correction step to a hybrid saddle-point operator that combines ideas from gradient td and hybrid td. the method is designed for fixed-policy linear prediction under standard stochastic assumptions. the paper provides a formal convergence analysis, showing that the new metric can lead to faster convergence by better aligning updates with the problem structure.

the approach builds on recent work that suggests behavior-policy information can improve update geometry in off-policy learning. by using the bellman matrix instead of feature covariance, sthtd-mp aims to reduce variance and speed up learning without extra computational cost. the method is evaluated in standard off-policy prediction tasks, demonstrating improved sample efficiency compared to existing mirror-prox td variants. this work contributes to making off-policy reinforcement learning more practical for real-world applications.

why it matters: faster off-policy learning can make reinforcement learning more data-efficient, reducing training time and cost in applications like robotics and recommendation systems.

source: arxiv artificial intelligence: behavior-induced mirror-prox temporal-difference learning for faster off-policy prediction