Fora da Política Avaliação (OPE) é um método utilizado no campo de aprendizado por reforço to estimate the effectiveness of a particular policy based on data that was collected while following a different policy. In simpler terms, it allows researchers and practitioners to evaluate how well a new strategy might work without needing to deploy it in a live environment.
Em aprendizado por reforço, uma policy is a strategy that defines the actions an agent should take in different situations. However, obtaining data from a policy can be costly or risky, especially in real-world applications like healthcare or autonomous driving. OPE enables the use of historical data, which might have been gathered using an older or different policy, to infer how well a new policy would perform.
Existem duas abordagens principais para OPE: amostragem de importância and avaliação baseada em modelos. Importance sampling adjusts the data collected from the old policy to account for the differences in behavior between the old and new policies. This method weights the actions observed in the data according to how likely they would have been under the new policy. Model-based evaluation, on the other hand, involves creating a model of the environment and using it to simulate the performance of the new policy.
OPE is particularly valuable because it helps decision-makers understand the potential impact of changes in policies without experimenting in potentially harmful or costly ways. It plays a crucial role in various fields, including personalized recommendations, finance, and clinical trials, enabling safer and more efficient exploration de novas estratégias.