coar
coar

Reputation: 15

PPO constantly learns to do nothing in a grid-world setting

I am trying to solve a custom grid-world environment with PPO. The grid is 5x5 and one of the cells is the depot, where the agent starts. Across time, items appear stochastically on the grid and remain there for 15 time steps before disappearing again. The goal of the agent is to collect as many items as possible and bring them to the depot. The agent has a capacity of one (i.,e., he only can carry one item at a time) and can decide between the actions up, down, right, left or doing nothing (note that picking up or dropping off are not separate actions as these things happen automatically when on an item cell or the depot). For a sucessful item pick up and drop off he receives a total reward of +15 (split in +7.5 for picking up and +7.5 for dropping off an item). Each step without pickup or dropoff yields -1, except if the chosen action is to do nothing in which case the agent receives a reward of 0.

I chose to represent metrics such as agent or items positions with a vector of size 25 (i.e., one entry for each cell that is 1 or any other relevant number if there is an item/agent and 0 otherwise). As such, my observation space consists of the following: free capacity, agent position, item positions, remaining time, target location, manhattan distances to items and target, remaining time of and distance to the closest item, distance from the closest item to the target, distance to walls.

The actor and critic network both consist of 4 hidden layers with 128 neurons and ReLU activation function. As for my hyperparameters, I chose them as follows:

learning_rate: 0.0001
gamma: 0.99
lam: 0.95
clip_ratio: 0.2
value_coef: 0.5
entropy_coef: 0.5
num_trajectories: 5
num_epochs: 4
num_minibatches: 4
max_grad_norm: 0.5

Now, when running this, PPO is not able to learn any usable policy and eventually ends up just doing nothing. Although this is a reasonable policy to learn given that this at least does not yield negative total reward, it clearly is not desired/optimal.

Since this somewhat resembles a local optimum that PPO gets stuck in, I figured it might be a hyperparameter / exploration thing. Therefore, I increased the entropy coefficient to increase exploration and likewise upped the number of trajectories per policy rollout so that the agent has more experience available when updating the networks. However, nothing I tried seemed to work. I even ran a WandB sweep and none of all the 100 runs of that sweep achieved a total reward above 0. After observing this, I thought that there has to be a bug of some sort in the code, which is why I went over the code over and over again to try to figure out what went wrong. However, I could not spot any error in the implementation (which is not to say there is one, I just did not find any mistake after x times of going over the code).

Does anyone have a clue what keeps PPO from learning a good strategy? Obviously, the agent has problems connecting the actions of picking up an item and dropping it off. However, I do not understand why this is the case since due to the splitted reward of picking up and dropping off it should be fairly straightforward for the agent to figure out that with free capacity he should go to any item cell and with full capacity he should go to the depot.

If needed or interested, you can find the entire code via pastebin here: https://pastebin.com/zuRprVWR.

What else can I try to solve this problem? Do you think the problem actually stems from an implementation / logic error or is there something else going on? Or is PPO just not able to solve this problem after all and some other algorithm might be the better choice?

Upvotes: -3

Views: 50

Answers (0)

Related Questions