source: arxiv machine learning: policy regret for embedding model routing: contextual bandits with low-rank experts

level: research

modern recommendation systems often route queries to different embedding models, but this problem is not well understood under realistic conditions like adversarial queries and limited feedback. the paper frames embedding model routing as an adversarial contextual linear bandit with low-rank experts. contexts are queries, actions are items, and experts are the embedding models that operate on low-rank latent spaces. standard regret measures fail due to structural misspecification or statistical issues.

the authors identify a log-quadratic policy class that is expressive enough for query-dependent routing yet structured for efficient online learning. they propose a policy gradient algorithm called hypentropy policy gradient (hpg). hpg adapts to unknown low-rank structure without needing prior knowledge of the rank. the algorithm achieves a regret bound that scales with the sum of the true rank and a log-determinant term, avoiding dependence on the ambient dimension.

the work provides theoretical guarantees for hpg under adversarial contexts and bandit feedback. it shows that the algorithm can learn effective routing policies even when only partial feedback is available. this addresses a gap in understanding how to dynamically select among multiple embedding models in production systems. the approach is validated through analysis, though empirical results are not detailed in the abstract.

why it matters: this research offers a principled way to route queries to the best embedding model in real-time, which can improve recommendation accuracy and efficiency without manual tuning.


source: arxiv machine learning: policy regret for embedding model routing: contextual bandits with low-rank experts