Learning in Markovian bandits with non-observable states and constrained decision epochs

arXiv:2606.27448v1 Announce Type: new Abstract: This paper studies the problem of regret minimization in Markovian bandits with \emph{non-observable states} and possibly \emph{constrained} decision epochs. The focus is restricted to a ``pure'' regret benchmark, that compares the performance of the learning algorithm to the best \emph{pure policy} which -- akin to optimal policies of stochastic bandits -- picks the optimal arm from start to finish without ever switching. We introduce a generaliza...

arXiv cs.LG ·Thomas Hira, Victor Boone, Urtzi Ayesta, Ina Maria Verloop ·
compartilhar: