To use value iteration, we need to quantitatively define rewards/penalties for our choices
Value Iteration (VI): decision making
Value (utility): the value of a state represents the minimum (deterministic) / min. expected (stochastic) cost to reach the goal from that state
Plan: series of actions
Policy: a mapping from all states to actions
Optimal Policy: a policy which achieves some goal (whilst accumulating the least (deterministic) / min. expected(stochastic) cost)
Deterministic: no randomness in the outcome
Stochastic: outcome governed by a probability distribution
Argmin (argument of the minimum): element associated with / responsible for the minimum
Sweep: exhaustive visit to all states
<aside> 📌 SUMMARY: Value iteration requires us to quantitatively define rewards/penalties for our actions
</aside>
Initialize a random $V(s)$ for our state table
We continuously calculate the value from taking the best action and updating the state table until convergence.
We just need an approximate policy
For our application, $\gamma = 1$ (can ignore)
Algorithm parameter: a small threshold $\theta>0$ determining accuracy of estimation
Initialize $V(s)$, for all $s\in S^+$, arbitrarily except that $V(\text{terminal}) = 0$
Loop:
$\Delta \leftarrow 0$
Loop for each $s\in S:$
$v \leftarrow V(s)$
$V(s) \leftarrow \text{max}_a[r+\gamma V(s')]$
$\Delta \leftarrow \text{max}(\Delta, |v-V(s)|)$
until $\Delta < \theta$
Output a deterministic policy, $\pi\approx\pi*$, s.t. $\pi(s) = \text{argmax}_a[r+\gamma V(s')]$
<aside> 💡 We only need to get an approximation as it may not be worth it to calculate until convergence
</aside>
Initialize $V(s)$, for all $s\in S^+$, arbitrarily except that $V(\text{terminal}) = 0$
Instead of considering the reward per step, we consider the costs (thus focusing on the min instead)
Do until convergence:
for all states: $s\\in(s_1,s_2,...s_n)$
For all states $s \in (s_1, s_2,...s_n):$ $\pi(s) \leftarrow \text{argmin}_\text{action}[\text{cost}+V(s')]$
<aside> 💡 Convergence occurs when we compare the state tables before and after and see that there are no changes
</aside>
<aside> 📌 SUMMARY: To obtain our optimal policy, we can keep calculating a $V(s)$ until we hit convergence, then take the argmin of all the actions in a state
</aside>
<aside> 📌 SUMMARY: We aim for a convergence of the utility grid whenever we do an update. A single update may cause more updates in neighboring cells
</aside>