Author(s): Alexander Goldenshluger | Assaf Zeevi
Journal: Stochastic Systems
ISSN 1946-5238
Volume: 3;
Issue: 1;
Start page: 230;
Date: 2013;
Original page
Keywords: Sequential allocation | estimation | bandit problems | regret | minimax | rate–optimal policy
ABSTRACT
We consider a two–armed bandit problem which involves sequentialsampling from two non-homogeneous populations. The responsein each is determined by a random covariate vector and a vector ofparameters whose values are not known a priori.The goal is to maximize cumulative expected reward. We study this problemin a minimax setting, and develop rate-optimal polices that combinemyopic action based on least squares estimates with a suitable "forced sampling'' strategy. It is shown that the regret growslogarithmically in the time horizon n and no policy can achievea slower growth rate over all feasible problem instances. In thissetting of linear response bandits, the identity of thesub-optimal action changes with the values of the covariatevector, and the optimal policy is subject to sampling from theinferior population at a rate that grows like $sqrt{n}$.
Journal: Stochastic Systems
ISSN 1946-5238
Volume: 3;
Issue: 1;
Start page: 230;
Date: 2013;
Original page
Keywords: Sequential allocation | estimation | bandit problems | regret | minimax | rate–optimal policy
ABSTRACT
We consider a two–armed bandit problem which involves sequentialsampling from two non-homogeneous populations. The responsein each is determined by a random covariate vector and a vector ofparameters whose values are not known a priori.The goal is to maximize cumulative expected reward. We study this problemin a minimax setting, and develop rate-optimal polices that combinemyopic action based on least squares estimates with a suitable "forced sampling'' strategy. It is shown that the regret growslogarithmically in the time horizon n and no policy can achievea slower growth rate over all feasible problem instances. In thissetting of linear response bandits, the identity of thesub-optimal action changes with the values of the covariatevector, and the optimal policy is subject to sampling from theinferior population at a rate that grows like $sqrt{n}$.