Date: March 15, 2024

Topic: Core Ideas in Value Iterations

Recall

To use value iteration, we need to quantitatively define rewards/penalties for our choices

Notes

Core Ideas in Value Iterations

Value Iteration (VI): decision making

Decisions have both:
- Immediate consequences
- Long-term effects
If we can quantitatively define rewards/penalties for each and every aspect associated with our choices
- We can use VI to objectively determine the most optimal decisions possible
- For e.g., assigning points to our choices

Terminology

Value (utility): the value of a state represents the minimum (deterministic) / min. expected (stochastic) cost to reach the goal from that state

Indicates the usefulness of being in a particular state

Plan: series of actions

Policy: a mapping from all states to actions

Optimal Policy: a policy which achieves some goal (whilst accumulating the least (deterministic) / min. expected(stochastic) cost)

Deterministic: no randomness in the outcome

Stochastic: outcome governed by a probability distribution

Argmin (argument of the minimum): element associated with / responsible for the minimum

Sweep: exhaustive visit to all states

<aside> 📌 SUMMARY: Value iteration requires us to quantitatively define rewards/penalties for our actions

</aside>

Date: March 17, 2024

Topic: Value Iteration Algorithm

Recall

Initialize a random $V(s)$ for our state table

We continuously calculate the value from taking the best action and updating the state table until convergence.

We just need an approximate policy

For our application, $\gamma = 1$ (can ignore)

Notes

Value Iteration Algorithm

Iterative algorithm that calculates the true utility of each state by continually improving estimates until convergence
Upon convergence, the utility is the minimum (det.) / min. expected (sto.) cost to reach the goal
An optimal policy can be obtained from the converged utility grid

Generalized Form

Algorithm parameter: a small threshold $\theta>0$ determining accuracy of estimation

Initialize $V(s)$, for all $s\in S^+$, arbitrarily except that $V(\text{terminal}) = 0$

Loop:

$\Delta \leftarrow 0$

Loop for each $s\in S:$

$v \leftarrow V(s)$

$V(s) \leftarrow \text{max}_a[r+\gamma V(s')]$

$\Delta \leftarrow \text{max}(\Delta, |v-V(s)|)$

until $\Delta < \theta$

Output a deterministic policy, $\pi\approx\pi*$, s.t. $\pi(s) = \text{argmax}_a[r+\gamma V(s')]$

<aside> 💡 We only need to get an approximation as it may not be worth it to calculate until convergence

</aside>

Simpler Notation (for this course)

Initialize $V(s)$, for all $s\in S^+$, arbitrarily except that $V(\text{terminal}) = 0$

Instead of considering the reward per step, we consider the costs (thus focusing on the min instead)

Do until convergence:

   for all states: $s\\in(s_1,s_2,...s_n)$

For all states $s \in (s_1, s_2,...s_n):$ $\pi(s) \leftarrow \text{argmin}_\text{action}[\text{cost}+V(s')]$

<aside> 💡 Convergence occurs when we compare the state tables before and after and see that there are no changes

</aside>

<aside> 📌 SUMMARY: To obtain our optimal policy, we can keep calculating a $V(s)$ until we hit convergence, then take the argmin of all the actions in a state

</aside>

Date: March 17, 2024

Topic: Example Problem

<aside> 📌 SUMMARY: We aim for a convergence of the utility grid whenever we do an update. A single update may cause more updates in neighboring cells

</aside>