Ecuación de Bellman
La Ecuación de Bellman es un concepto clave en programación dinámica and aprendizaje por refuerzo that describes the relationship between the value of a decision problem at one point in time and the values at subsequent points in time. Named after Richard Bellman, who formulated it in the 1950s, this equation helps in breaking down complex convertir problemas en problemas secundarios más simples y manejables.
At its core, the Bellman Equation expresses the principle of optimality, which states that an política óptima has the property that whatever the initial state and initial decision are, the remaining decisions must be optimal with respect to the state resulting from the first decision. This can be mathematically represented in different forms, depending on the specific context, such as the value iteration or policy iteration in Markov Decision Processes (MDPs).
En una forma típica, la Ecuación de Bellman puede representarse como:
V(s) = max_a [ R(s, a) + γ ∑ P(s'|s, a) V(s') ]
Aquí, V(s) is the función de valor at state s, R(s, a) is the immediate reward received after taking action a in state s, P(s’|s, a) is the transition probability from state s to state s’ given action a, and γ is the discount factor that represents the difference in importance between future rewards and present rewards.
By solving the Bellman Equation, one can determine the optimal policy for decision-making processes, making it a foundational tool in various fields such as economics, operations research, and inteligencia artificial.