Python: Deep Q Learning agent doesn't seem to learn

Question

I am using gymnasium and torch. I initially created a custom environment by following the official gymnasium's guide: it shows you how to build a NxN Box where the agent needs to reach a randomly placed target by moving right, up, left or down. I modified it so that the agent needs to reach multiple targets in order to finish the episode. From the guide you can see that the state of the environment is a dictionary containing the agent's and the target's coordinates; I changed the space type from Box to Sequence, which allows me to specify an indefinite number of observations:

self.observation_space = gym.spaces.Dict(
    {
        "agent": gym.spaces.Box(0, size-1, shape=(2,), dtype=int),
        "targets": gym.spaces.Sequence(
            gym.spaces.Box(0, size-1, shape=(2,), dtype=int) # element type
        )
    })

I took the agent's code from here and adapted it to my environment. Most importantly, I've replaced the replay() method with optimize(), which I took from this github repository (shown in this video). The code is originally for gymnasium's FrozenLake environment, where there are holes you can fall into and a single reward to get to, but I've adapted it to my situation. Another important change is the DQN class (Deep Q Network): the original one has 2 layers, but mine has 3 with 128 neurons. Here is its definition:

class DQN(nn.Module):
    def __init__(self, n_input, n_output):
        super(DQN, self).__init__()
        self.layer1 = nn.Linear(n_input, 128)
        self.layer2 = nn.Linear(128, 128)
        self.layer3 = nn.Linear(128, n_output)
        
    def forward(self, x):
        x = torch.relu(self.layer1(x))
        x = torch.relu(self.layer2(x))
        res = self.layer3(x)
        return res

However, the neural networks don't seem to learn, and the results are pretty inconsistent: when training (general case), the steps taken in the first 300 episodes vary from 100 to 600, then it continues hitting MAX_STEPS (which I've set to 1000), and the final ~100 episodes are completed with 100 or so steps. To me, it doesn't look like it's learning at all. It all just seems random.

Shown below here is the code used in the optimize() function inside the agent, which is responsible for updating the neural networks. I added some comments to help you (and me) figure out what the different sections do. Also, here are some clarifications about the rest of the code in the program:

The 2 neural networks are synchronized every 10 episodes
The optimize function gets executed once per episode
In my environment, unlike FrozenLake, there is no way the agent can lose: termination means victory, and truncation happens during the training loop after hitting MAX_STEPS; at that point it breaks the cycle and resets the environment
The state of my environment is represented by the agent location and the targets' locations
- The tensor that is passed to the neural network is the combination of all the coordinates, e.g. with 2 targets: [0, 0, 2, 3, 1, 6] (the first 2 are [x,y] coords of the agent, the next 2 are the first target's,...)
- When a target gets collected, its coordinates in the tensor are replaced by -1 as I cannot change the number of input neurons
(See code below) when the for loop gets to an experience with terminated set to True, it knows the action that has been made led to victory, and it assigns a higher priority.

Code:

def optimize(self, ever_won):
        # checks if it has enough experience and has won at least once
        if(self.batch_size>len(self.memory)) or (not ever_won): 
            return

    mini_batch = self.memory.sample(self.batch_size) # takes a sample from its memory
    current_q_list = []
    target_q_list = []
    
    for state, reward, action, new_state, terminated in mini_batch: # cycles the experiences
        # if the experience led to a victory
        if terminated: 
            target = torch.FloatTensor([5]) # I assign a higher priority to the action that made it win

        # otherwise, the q value is calulated
        else:
            with torch.no_grad():
                target = torch.FloatTensor(
                    reward + self.gamma * self.target_dqn(self.state_to_tensor(new_state)).max()
                )

        # get the q values from policy_dqn (main network)
        current_q = self.policy_dqn(self.state_to_tensor(state))
        current_q_list.append(current_q)

        # q values from target_dqn
        target_q = self.target_dqn(self.state_to_tensor(state)) 
        # adjust the q value of the action
        target_q[action] = target
        target_q_list.append(target_q)
    
    # Compute loss for the whole minibatch
    loss = self.loss_fn(torch.stack(current_q_list), torch.stack(target_q_list))
    # saving loss for later plotting
    self.losses.append(loss.item()) # self.losses = list()
    # Optimize the model
    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()
    # epsilon decay
    self.epsilon = max(self.epsilon_min, self.epsilon*self.epsilon_decay)

Edit 12 feb 2025: I've tried plotting loss using matplotlib.pyplot and the graph below is the result. loss gets saved in a list in optimize()'s for loop (see the line I've added at the end of the code above). As you can see there are many spikes; I guess that's from when during the training the model keeps hitting MAX_STEPS.

Python: Deep Q Learning agent doesn't seem to learn

Answers (0)

Related Questions

Python: Deep Q Learning agent doesn&#39;t seem to learn

Answers (0)

Related Questions

Python: Deep Q Learning agent doesn't seem to learn