Um Finito Processo de Decisão de Markov (MDP) is a structured representation used in teoria da decisão and aprendizado por reforço to model scenarios where outcomes depend on both the decisions made by an agent and the stochastic nature of the environment. An MDP is defined by the following components:
- Estados (S): A finite set of states that represent all possible situations the agent can encounter.
- Ações (A): A finite set of actions available to the agent, which can influence the transition from one state para outro.
- Transição Probabilidade (P): A function that defines the probability of moving from one state to another given a specific action. This is often denoted as P(s’|s,a), the probability of reaching state s’ from state s by taking action a.
- Recompensas (R): A função de recompensa that assigns a numerical reward to each state or state-action pair, guiding the agent towards desirable outcomes.
- Fator de Desconto (γ): A factor between 0 and 1 that determines the present value of future rewards, allowing the agent to weigh immediate rewards more heavily than those received later.
Os MDPs são amplamente utilizados em várias áreas, incluindo inteligência artificial, robotics, economics, and operations research, as they provide a formal way to model sequential decision-making problems. The goal in an MDP is to find a policy—a mapping from states to actions—that maximizes the expected cumulative reward over time. Solving an MDP typically involves algorithms such as Value Iteration or Policy Iteration, which help identify the optimal policy for the agent.