Why does Q-learning work in an unknown environment?

Q-learning uses instant reward matrix R to model an environment. That means it uses a known matrix R for learning, So why do people say "Q-learning can work in an unknown environment"?

Upvotes: 2

Answers (1)

Mikhail Korobov

Reputation: 22248

Q-Learning is an algorithm to find a policy for selecting optimal actions in a Markov Decision Process (MDP). An environment is defined not only by the rewards but also by state-transition probabilities. MDP doesn't require the rewards to be a fixed matrix: it can be any function.

If the state-transition probabilities and the rewards of an MDP are known for all states and actions, then the optimal policy can be found using dynamic programming techniques, so you don't need reinforcement learning for that.

Unlike dynamic programming techniques, Q-Learning works if the rewards and the state-transition probabilities are unknown: that is, you only see a reward value after taking an action.

Q-learning doesn't use instant reward matrix R, it only requires that after taking an action a at state s it receives state s' and reward value r.

Upvotes: 2

Why does Q-learning work in an unknown environment?

Answers (1)

Related Questions