ISE 537 Spring 2025, GPP
Homework 5 (100 pts in total)
Howework 5
Please submit Python Notebooks or PDF files that contain your Python code.
1. (Value Iteration Algorithm for Student Dilemma.) Let us consider the so-called Student Dilemma. It is a Markov decision process with 7 states, among which states 5 , 6, 7 are terminal states (observe that there are no arrows going out from these states). All the transitions are Markov and are labeled on the picture. For example, in state 1, we can choose to either work or rest. If we rest, then with probability 0.5 we will transition to state 2, and with probability 0.5 we will stay in state 1. The rewards marked in red are the rewards we can get as long as we stay in the state. For example, we will collect reward 0 as long as we are in state 1, no matter if we choose to rest or work. Consider an infinite horizon decision-making problem (without discount, i.e. γ = 1) until reaching one of the terminal states. The goal is to find the policy that maximizes the expected sum of rewards before achieving a terminal state.
Recall that the Bellman equation
(a) The optimal value function of states 1, 2, ··· , 7 satisfy:
Question: Please explain how to fill in the blanks?
(b) We also covered in class the Value Iteration Algorithm to calculate the numerical solution of (1):
i. Initialization at iteration 0: Let V(0)(s) be any function V (0) : S → R
ii. Given V (i) at iteration i, we compute:
iii. Terminate when V (i) stops improving; e.g., when maxs|V(i+1)(s) — V(i)(s)| is small
iv. Return the greedy policy
Question: Please implement the algorithm to solve (1) (ideally with a Python note- book). Stop when maxs|V(i+1)(s) — V(i)(s)| drops below 10−2 .
i. How many iterations does it take to reach this error level?
ii. Please report your estimated values of V (1), ··· , V (7).
iii. Describe the out-put greedy policy.