Prashant Pandey
Prashant Pandey

Reputation: 4682

Stochastic state transitions in MDP: How does Q-learning estimate that?

I am implementing Q-learning to a grid-world for finding the most optimal policy. One thing that is bugging me is that the state transitions are stochastic. For example, if I am in the state (3,2) and take an action 'north', I would land-up at (3,1) with probability 0.8, to (2,2) with probability 0.1 and to (4,2) with probability 0.1. How do I fit this information in the algorithm? As I have read so far, Q-learning is a "model-free" learning- It does not need to know the state-transition probabilities. I am not convinced as to how the algorithm will automatically find these transition probabilities during the training process. If someone can clear things up, I shall be grateful.

Upvotes: 4

Views: 1316

Answers (1)

Nick Walker
Nick Walker

Reputation: 808

Let's look at what Q-learning guarantees to see why it handles transition probabilities.

Let's call q* the optimal action-value function. This is the function that returns the correct value of taking some action in a some state. The value of a state-action pair is the expected cumulative reward of taking the action, then following an optimal policy afterwards. An optimal policy is simply a way of choosing actions that achieves the maximum possible expected cumulative reward. Once we have q*, it's easy to find an optimal policy; from each state s that you find yourself in, greedily choose the action that maximizes q*(s,a). Q-learning learns q* given that it visits all states and actions infinitely many times.

For example, if I am in the state (3,2) and take an action 'north', I would land-up at (3,1) with probability 0.8, to (2,2) with probability 0.1 and to (4,2) with probability 0.1. How do I fit this information in the algorithm?

Because the algorithm visits all states and actions infinitely many times, averaging q-values, it learns an expectation of the value of attempting to go north. We go north so many times that the value converges to the sum of each possible outcome weighted by its transition probability. Say we knew all of the values on the gridworld except for the value of going north from (3,2), and assume there is no reward for any transition from (3,2). After sampling north from (3,2) infinitely many times, the algorithm converges to the value 0.8 * q(3,1) + 0.1 * q(2,2) + 0.1 * q(4,2). With this value in hand, a greedy action choice from (3,2) will now be properly informed of the true expected value of attempting to go north. The transition probability gets baked right into the value!

Upvotes: 4

Related Questions