Bandit Linéaire
Un bandit linéaire est un problème spécifique dans le domaine de apprentissage par renforcement and multi-armed bandits, where an agent must choose between a set of actions (or arms) to maximize its cumulative rewards. In a linear bandit setting, the expected reward for each action is modeled as a linear function of underlying features associated with the action.
More formally, each action is represented by a feature vector, and the reward for choosing an action is determined by the inner product of this feature vector and a linear vecteur de paramètres that represents the agent’s preferences or beliefs about the actions. This relationship can be expressed as:
R(a) = θ · x(a)
où R(a) est la récompense attendue pour l'action a, θ est le vecteur de paramètres, et x(a) est le vecteur de caractéristiques associé à l'action a.
The linear bandit model is particularly useful in scenarios where the relationship between features and rewards is approximately linear, allowing for efficient learning and decision-making. The agent learns the capacité optimale en paramètres vector θ through exploration (trying different actions) and exploitation (choosing the best-performing actions based on current knowledge).
Linear bandits are commonly applied in various fields such as online advertising, systèmes de recommandation, and adaptive clinical trials, where the goal is to maximize user engagement or treatment effectiveness based on historical data.
In summary, linear bandits provide a framework for making sequential decisions under uncertainty, leveraging linear relationships to optimize rewards over time.