An on-policy algorithm is a type of reinforcement learning algorithm that learns and updates its policy based on the actions taken by the agent under the current policy during training. This means that the algorithm evaluates and improves the policy that it is currently following, rather than learning from a separate or older policy.
In on-policy methods, the agent explores the environment and collects data by following its current policy, which is typically a probabilistic mapping from states to actions. The key aspect of on-policy algorithms is that they utilize the actions chosen by the current policy to inform updates and improve that same policy. This contrasts with off-policy algorithms, which learn from actions taken by a different policy, allowing them to learn from past experiences or data generated by different strategies.
Common examples of on-policy algorithms include SARSA (State-Action-Reward-State-Action) and Policy Gradient methods. In SARSA, the agent updates its Q-values based on the action it actually takes, which is determined by its current policy. In policy gradient methods, the agent directly optimizes the policy by adjusting the parameters based on the rewards received for actions taken according to the policy.
On-policy algorithms can be quite effective in environments where exploration and exploitation need to be balanced carefully, as they are directly tied to the agent’s current strategy. However, they may be less sample-efficient compared to off-policy methods, as they do not leverage past experiences that could provide additional learning opportunities.