AI Glossary: What Is On-Policy Algorithm? Definition & Meaning

An on-policy algorithm is a type of algoritmo de aprendizaje por refuerzo that learns and updates its policy based on the actions taken by the agent under the current policy during training. This means that the algorithm evaluates and improves the policy that it is currently following, rather than learning from a separate or older policy.

En los métodos en política, el agente explora el environment and collects data by following its current policy, which is typically a probabilistic mapping from states to actions. The key aspect of on-policy algorithms is that they utilize the actions chosen by the current policy to inform updates and improve that same policy. This contrasts with algoritmos fuera de política, which learn from actions taken by a different policy, allowing them to learn from past experiences or data generated by different strategies.

Ejemplos comunes de algoritmos en política incluyen SARSA (State-Action-Reward-State-Action) and Gradiente de Política methods. In SARSA, the agent updates its Q-values based on the action it actually takes, which is determined by its current policy. In policy gradient methods, the agent directly optimizes the policy by adjusting the parameters en función de las recompensas recibidas por acciones tomadas según la política.

On-policy algorithms can be quite effective in environments where exploration and exploitation need to be balanced carefully, as they are directly tied to the agent’s current strategy. However, they may be less sample-efficient compared to off-policy methods, as they do not leverage past experiences that could provide additional learning opportunities.