DQN exploration strategy for large grid-world environment

Question

My task involves a large grid-world type of environment (grid size may be 30x30, 50x50, 100x100, at the largest 200x200). Each element in this grid either contains a 0 or a 1, which are randomly initialized in each episode. My goal is to train an agent, which starts in a random position on the grid, and navigate to every cell with the value 1, and set it to 0. (Note that in general, the grid is mostly 0s, with sparse 1s).

I am trying to train a DQN model with 5 actions to accomplish this task:

1) Move up

2) Move right

3) Move down

4) Move left

5) Clear (sets current element to 0)

The "state" that I give the model is the current grid (NxM tensor). I provide the agent's current location through the concatenation of a flattened one-hot (1x(N*N)) tensor to the output of my convolutional feature vector (before the FC layers).

However, I find that the epsilon-greedy exploration policy does not lead to sufficient exploration. Also, early in the training (when the model is essentially choosing random actions anyway), the pseudo-random action combinations end up "canceling out", and my agent does not move far enough away from the starting location to discover that there is a cell with value 1 in a different quadrant of the grid, for example. I am getting a converging policy on a 5x5 grid w/ an non-convolutional MLP model, so I think that my implementation is sound.

1) How I might encourage exploration that will not always "cancel out" to only explore a very local region to my starting location?

2) Is this approach a good way to accomplish this task (assuming I want to use RL)?

3) I would think that attempting to work with a "continuous" action space (model outputs indexes of "1" elements) would be more difficult to achieve convergence. Is it wise to always try to use discrete action spaces?

DQN exploration strategy for large grid-world environment

Answers (1)

Related Questions