Bandido linear
Um bandido linear é um problema específico no campo de aprendizado por reforço and multi-armed bandits, where an agent must choose between a set of actions (or arms) to maximize its cumulative rewards. In a linear bandit setting, the expected reward for each action is modeled as a linear function of underlying features associated with the action.
More formally, each action is represented by a feature vector, and the reward for choosing an action is determined by the inner product of this feature vector and a linear vetor de parâmetros that represents the agent’s preferences or beliefs about the actions. This relationship can be expressed as:
R(a) = θ · x(a)
onde R(a) é a recompensa esperada para a ação a, θ é o vetor de parâmetros, e x(a) é o vetor de características associado à ação a.
The linear bandit model is particularly useful in scenarios where the relationship between features and rewards is approximately linear, allowing for efficient learning and decision-making. The agent learns the parâmetro ótimo vector θ through exploration (trying different actions) and exploitation (choosing the best-performing actions based on current knowledge).
Linear bandits are commonly applied in various fields such as online advertising, sistemas de recomendação, and adaptive clinical trials, where the goal is to maximize user engagement or treatment effectiveness based on historical data.
In summary, linear bandits provide a framework for making sequential decisions under uncertainty, leveraging linear relationships to optimize rewards over time.