Reputation: 897
My task involves a large grid-world type of environment (grid size may be 30x30, 50x50, 100x100, at the largest 200x200). Each element in this grid either contains a 0 or a 1, which are randomly initialized in each episode. My goal is to train an agent, which starts in a random position on the grid, and navigate to every cell with the value 1, and set it to 0. (Note that in general, the grid is mostly 0s, with sparse 1s).
I am trying to train a DQN model with 5 actions to accomplish this task:
1) Move up
2) Move right
3) Move down
4) Move left
5) Clear (sets current element to 0)
The "state" that I give the model is the current grid (NxM tensor). I provide the agent's current location through the concatenation of a flattened one-hot (1x(N*N)) tensor to the output of my convolutional feature vector (before the FC layers).
However, I find that the epsilon-greedy exploration policy does not lead to sufficient exploration. Also, early in the training (when the model is essentially choosing random actions anyway), the pseudo-random action combinations end up "canceling out", and my agent does not move far enough away from the starting location to discover that there is a cell with value 1 in a different quadrant of the grid, for example. I am getting a converging policy on a 5x5 grid w/ an non-convolutional MLP model, so I think that my implementation is sound.
1) How I might encourage exploration that will not always "cancel out" to only explore a very local region to my starting location?
2) Is this approach a good way to accomplish this task (assuming I want to use RL)?
3) I would think that attempting to work with a "continuous" action space (model outputs indexes of "1" elements) would be more difficult to achieve convergence. Is it wise to always try to use discrete action spaces?
Upvotes: 0
Views: 879
Reputation: 5412
Exploration is one of the big challenges in RL.
However, your problem does not seem too hard for a simple e
-greedy, especially if you have an initial random state.
First, some tricks that you can use:
e
with the episode steps and reset it for the next episode, or start with a large e
and decrease it with the learning iteration.Regarding your questions:
1) The above tricks should address this. There are some methods to enhance exploration to visit unexplored regions of the state space, such as "intrinsic motivation" and "curiosity". This is a nice paper about it.
2) Your problem is fully discrete and not that big, so value (or policy) iteration (which as just dynamic programming) would work better.
3) It depends on your problem. Is the discretization accurate enough to allow you to perform optimally? If so, go for it. (But usually this is not the case for harder problems).
Upvotes: 1