有限 マルコフ決定過程 (MDP) is a structured representation used in 意思決定理論 and 強化学習 to model scenarios where outcomes depend on both the decisions made by an agent and the stochastic nature of the environment. An MDP is defined by the following components:
- 状態(S): A finite set of states that represent all possible situations the agent can encounter.
- 行動(A): A finite set of actions available to the agent, which can influence the transition from one state もう一方へ。
- 遷移 確率 (P): A function that defines the probability of moving from one state to another given a specific action. This is often denoted as P(s’|s,a), the probability of reaching state s’ from state s by taking action a.
- 報酬(R): A 報酬関数 that assigns a numerical reward to each state or state-action pair, guiding the agent towards desirable outcomes.
- 割引率 (γ): A factor between 0 and 1 that determines the present value of future rewards, allowing the agent to weigh immediate rewards more heavily than those received later.
MDPはさまざまな分野で広く使用されており、特に 人工知能, robotics, economics, and operations research, as they provide a formal way to model sequential decision-making problems. The goal in an MDP is to find a policy—a mapping from states to actions—that maximizes the expected cumulative reward over time. Solving an MDP typically involves algorithms such as Value Iteration or Policy Iteration, which help identify the optimal policy for the agent.