Linear Bandit
A linear bandit is a specific problem in the field of reinforcement learning and multi-armed bandits, where an agent must choose between a set of actions (or arms) to maximize its cumulative rewards. In a linear bandit setting, the expected reward for each action is modeled as a linear function of underlying features associated with the action.
More formally, each action is represented by a feature vector, and the reward for choosing an action is determined by the inner product of this feature vector and a linear parameter vector that represents the agent’s preferences or beliefs about the actions. This relationship can be expressed as:
R(a) = θ · x(a)
where R(a) is the expected reward for action a, θ is the parameter vector, and x(a) is the feature vector associated with action a.
The linear bandit model is particularly useful in scenarios where the relationship between features and rewards is approximately linear, allowing for efficient learning and decision-making. The agent learns the optimal parameter vector θ through exploration (trying different actions) and exploitation (choosing the best-performing actions based on current knowledge).
Linear bandits are commonly applied in various fields such as online advertising, recommendation systems, and adaptive clinical trials, where the goal is to maximize user engagement or treatment effectiveness based on historical data.
In summary, linear bandits provide a framework for making sequential decisions under uncertainty, leveraging linear relationships to optimize rewards over time.