A Finite Markov Decision Process (MDP) is a structured representation used in decision theory and reinforcement learning to model scenarios where outcomes depend on both the decisions made by an agent and the stochastic nature of the environment. An MDP is defined by the following components:
- States (S): A finite set of states that represent all possible situations the agent can encounter.
- Actions (A): A finite set of actions available to the agent, which can influence the transition from one state to another.
- Transition Probability (P): A function that defines the probability of moving from one state to another given a specific action. This is often denoted as P(s’|s,a), the probability of reaching state s’ from state s by taking action a.
- Rewards (R): A reward function that assigns a numerical reward to each state or state-action pair, guiding the agent towards desirable outcomes.
- Discount Factor (γ): A factor between 0 and 1 that determines the present value of future rewards, allowing the agent to weigh immediate rewards more heavily than those received later.
MDPs are widely used in various fields, including artificial intelligence, robotics, economics, and operations research, as they provide a formal way to model sequential decision-making problems. The goal in an MDP is to find a policy—a mapping from states to actions—that maximizes the expected cumulative reward over time. Solving an MDP typically involves algorithms such as Value Iteration or Policy Iteration, which help identify the optimal policy for the agent.