An on-policy algorithm is a type of Verstärkungslernalgorithmus that learns and updates its policy based on the actions taken by the agent under the current policy during training. This means that the algorithm evaluates and improves the policy that it is currently following, rather than learning from a separate or older policy.
Bei on-policy Methoden erkundet der Agent die environment and collects data by following its current policy, which is typically a probabilistic mapping from states to actions. The key aspect of on-policy algorithms is that they utilize the actions chosen by the current policy to inform updates and improve that same policy. This contrasts with off-policy Algorithmen, which learn from actions taken by a different policy, allowing them to learn from past experiences or data generated by different strategies.
Häufige Beispiele für on-policy Algorithmen sind SARSA (State-Action-Reward-State-Action) and Policy-Gradient methods. In SARSA, the agent updates its Q-values based on the action it actually takes, which is determined by its current policy. In policy gradient methods, the agent directly optimizes the policy by adjusting the parameters basierend auf den Belohnungen, die für Aktionen gemäß der Policy erhalten wurden.
On-policy algorithms can be quite effective in environments where exploration and exploitation need to be balanced carefully, as they are directly tied to the agent’s current strategy. However, they may be less sample-efficient compared to off-policy methods, as they do not leverage past experiences that could provide additional learning opportunities.