Lecture 2 | Notion

This slide is a recap of what an MDP is: a formal way to describe an agent repeatedly making decisions in a stochastic environment. The key idea is the Markov property: given the current state and chosen action, the distribution over the next state and reward does not depend on earlier history.

States: the finite set of situations the agent can be in (what the agent “needs to know” to predict what happens next).
Actions: the finite set of choices available to the agent in each state (often the same action set everywhere).
Dynamics: how the world responds to an action—captured by probabilities of what next state and reward you’ll get.
Discount factor: a number between 0 and 1 that expresses how much future rewards matter compared to immediate rewards (smaller means more short-sighted).

$$ p(s',r\mid s,a) \doteq \Pr\{S_t = s',\, R_t = r \mid S_{t-1} = s,\, A_{t-1} = a\} $$

Meaning of the symbols

$p(s',r\mid s,a)$: dynamics model; probability of seeing next state and reward given current state and action.
$s$: current state.
$a$: action taken in the current state.
$s'$: next state.
$r$: reward observed on the transition.
$S_{t-1}$: random variable for the state at time $t-1$.
$A_{t-1}$: random variable for the action at time $t-1$.
$S_t$: random variable for the state at time $t$.
$R_t$: random variable for the reward at time $t$.
$\gamma$: discount factor that weights future rewards.