Reputation: 23
For example, I have tried to run lambda iteration iteration on a random MDP. I noticed getting different policies depending on the value of lambda. Can TD(1) and TD(0) give different optimal policies?
Update: Increasing my initial value function gave me the same result for both cases.
Upvotes: 0
Views: 182
Reputation: 6679
Yes, in general, RL methods with convergence guarantees can converge to any optimal policy. So, if an MDP has several optimal policies, algorithms (including Policy iteration methods) could converge to any of the optimal policies.
Upvotes: 1