In this note we present two examples of compact-action finite-state Markov decision chains in which a policy improvement procedure yields wrong or limited results. In the first example, which exhibits a multichain structure, there is no convergence of the average rewards of the successive policies to the maximal value. In the second example, which has a unichain structure, the lack of uniqueness of maximizing policies in each step of the algorithm means that there is no convergence of either bias vectors or maximizing policies. Accordingly, no solution to the average optimality equations can be obtained.

,
doi.org/10.1080/15326348708807061, hdl.handle.net/1765/2250
Stochastic Models
Erasmus School of Economics

Dekker, R. (1987). Counter examples for compact action Markov decision chains with average reward criteria. Stochastic Models, 357–368. doi:10.1080/15326348708807061