Reputation: 184
I've been studying reinforcement learning, and understand the concepts of value/policy iteration, TD(1)/TD(0)/TD(Lambda), and Q-learning. What I don't understand is why Q-learning can't be used for everything. Why do we need "deep" reinforcement learning as described in DeepMind's DQN paper?
Upvotes: 3
Views: 985
Reputation: 194
Q-learning uses Q-tables to store Q-values and use them to select actions for the current state by using the corresponding Q-values.
But this is not always feasible. When we have large state space, our Q-table becomes very large and each estimated Q-values take long time to get updated and most of them may get updated only very few times so they are inaccurate.
To tackle these kind of problems, we use function approximators to learn the general Q-values. Neural networks are good at function approximation so, DQN was proposed to get the state representation and estimate the Q-values. Now the network learns to predict the Q-values using the low level feature of the state, so helps in generalization.
Upvotes: 3
Reputation: 2312
Q-learning is a model-free reinforcement learning method first documented in 1989. It is “model-free” in the sense that the agent does not attempt to model its environment. It arrives at a policy based on a Q-table which stores the result of taking any action from a given state. When the agent is in state s
, it refers to the Q-table for the state and picks the action with the highest associated award. In order for the agent to arrive at an optimal policy, it must balance exploration of all available actions for all states with exploiting what the Q-table says is the optimal action for a given state. If the agent always picks a random action, it will never arrive at an optimal policy; likewise, if the agent always chooses the action with the highest estimated reward, it may arrive at a sub-optimal policy since certain state-action pairs may not have been completely explored.
Given enough time, Q-learning can eventually find an optimal policy π for any finite Markov decision process (MDP). In the example of a simple game of Tic-Tac-Toe, the number of total disparate game states is less than 6,000. That might sound like a high number, but consider a simple video game environment in OpenAI’s gym environment known as “Lunar Lander”.
The goal is to use the lander’s thrusters to navigate it to land between the yellow flags, ensuring the lander’s inertia is slowed enough so as not to cause it to crash. The possible actions are: do nothing, use left thruster, use right thruster, use main center thruster. Using the main thruster incurs a small negative reward. Landing without crashing provides a large reward, and landing between the flags also provides a large reward. Crashing provides a large negative reward. The agent experiences state as a combination of the following parameters: the x
and y
coordinates of the lander, as well as its x
and y
velocity, rotation, angular velocity, and simple binary values for each leg to determine if it is touching the ground. Consider all the different possible states the agent could encounter from different combinations of these parameters; the state space of this MDP is enormous compared to tic-tac-toe. It would take an inordinate amount of time for the agent to experience enough episodes to reliably pilot the lander. The state space provided by the Lunar Lander environment is too large for traditional Q-learning to effectively solve in a reasonable amount of time, but with some adjustments (in the form of “deep” Q-learning) it is indeed possible for an agent to successfully navigate the environment on a regular basis within a reasonable amount of time.
As detailed in the DeepMind paper you linked to, Deep Q-learning is based on Tesauro’s TD-Gammon approach which approximates the value function from information received when the agent interacts with the environment. One major difference is that instead of constantly updating the value function, the experiences from an episode are processed in fixed sets, or batches. After an episode is completed, the oldest episode is removed from the set and the most recent episode is pushed into the set. This helps the algorithm explore the environment more efficiently because it attempts to prevent feedback loops. This use of batching is referred to as “experience replay.” It is also more efficient since learning from pairs of consecutive states may lead to inaccuracies due to how closely related those two states are.
TL;DR: When the state-action space is so large that regular Q-learning would take too long to converge, deep reinforcement learning may be a viable alternative because of its use of function approximation.
Upvotes: 4