Bellman equation (part2)
October 17, 2023
Definitions and Equations #
https://chat.openai.com/share/e71eecee-4263-4dc6-842e-a0dee3af28b0
this is part2 after Bellman equation (part1)
Definition of \( G_t \) #
It represents the random variable of cumulative rewards at time \( t \). The cumulative rewards can be expressed as:
\[\begin{aligned} G_t &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \\ G_t &= R_{t+1} + \gamma G_{t+1} \end{aligned} \]
Definition of \( V(s) \) #
It is the expected cumulative rewards given state \( s \). \[ V(s) = \mathbb{E}[G_t | S_t = s] \] Combining the above definitions, we get: \[ V(s) = \mathbb{E}[R_{t+1} + \gamma G_{t+1} | S_t = s] \]
The \( Q \) Function #
If you further condition \( V \) on action, you get the \( Q \) function: \[ Q(s, a) = \mathbb{E}[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a] \]
Law of Total Probability #
Using the law of total probability: \[ V(s) = \sum_{a} \pi(a|s) Q(s, a) \] \[ V(s) = \sum_{a} \pi(a|s) \mathbb{E}[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a] \]
G’s property #
This states that G doesn’t depend on t, and time series is supposed to be infinite.
\[ G(t+1|s) = G(t|s) \]
Therefore, the relation can be expressed as:
\[ \begin{aligned} Q(s, a) &= \mathbb{E}[R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a] \\ Q(s, a) &= \mathbb{E}[R_{t+1} + \gamma V(S_{t+1}) | S_t = s, A_t = a] \\ Q(s, a) &= \sum_{s’, r} p(s’, r | s, a) [r + \gamma V(s’)] \end{aligned} \]
Resulting in: \[ V(s) = \sum_a \pi(a|s) \sum_{s’, r} p(s’, r | s, a) [r + \gamma V(s’)] \]