Counter examples for compact action Markov decision chains with average reward criteria
In this note we present two examples of compact-action finite-state Markov decision chains in which a policy improvement procedure yields wrong or limited results. In the first example, which exhibits a multichain structure, there is no convergence of the average rewards of the successive policies to the maximal value. In the second example, which has a unichain structure, the lack of uniqueness of maximizing policies in each step of the algorithm means that there is no convergence of either bias vectors or maximizing policies. Accordingly, no solution to the average optimality equations can be obtained.
|Keywords||Markov Decision Chains, stochastic models|
|Persistent URL||dx.doi.org/10.1080/15326348708807061, hdl.handle.net/1765/2250|
Dekker, R.. (1987). Counter examples for compact action Markov decision chains with average reward criteria. Stochastic Models, 357–368. doi:10.1080/15326348708807061