
This slide is a recap of what an MDP is: a formal way to describe an agent repeatedly making decisions in a stochastic environment. The key idea is the Markov property: given the current state and chosen action, the distribution over the next state and reward does not depend on earlier history.
- States: the finite set of situations the agent can be in (what the agent “needs to know” to predict what happens next).
- Actions: the finite set of choices available to the agent in each state (often the same action set everywhere).
- Dynamics: how the world responds to an action—captured by probabilities of what next state and reward you’ll get.
- Discount factor: a number between 0 and 1 that expresses how much future rewards matter compared to immediate rewards (smaller means more short-sighted).
$$
p(s',r\mid s,a) \doteq \Pr\{S_t = s',\, R_t = r \mid S_{t-1} = s,\, A_{t-1} = a\}
$$
Meaning of the symbols
- $p(s',r\mid s,a)$: dynamics model; probability of seeing next state and reward given current state and action.
- $s$: current state.
- $a$: action taken in the current state.
- $s'$: next state.
- $r$: reward observed on the transition.
- $S_{t-1}$: random variable for the state at time $t-1$.
- $A_{t-1}$: random variable for the action at time $t-1$.
- $S_t$: random variable for the state at time $t$.
- $R_t$: random variable for the reward at time $t$.
- $\gamma$: discount factor that weights future rewards.