tommat208
tommat208

Reputation: 61

Python: Deep Q Learning agent doesn't seem to learn

I am using gymnasium and torch. I initially created a custom environment by following the official gymnasium's guide: it shows you how to build a NxN Box where the agent needs to reach a randomly placed target by moving right, up, left or down. I modified it so that the agent needs to reach multiple targets in order to finish the episode. From the guide you can see that the state of the environment is a dictionary containing the agent's and the target's coordinates; I changed the space type from Box to Sequence, which allows me to specify an indefinite number of observations:

self.observation_space = gym.spaces.Dict(
    {
        "agent": gym.spaces.Box(0, size-1, shape=(2,), dtype=int),
        "targets": gym.spaces.Sequence(
            gym.spaces.Box(0, size-1, shape=(2,), dtype=int) # element type
        )
    })

I took the agent's code from here and adapted it to my environment. Most importantly, I've replaced the replay() method with optimize(), which I took from this github repository (shown in this video). The code is originally for gymnasium's FrozenLake environment, where there are holes you can fall into and a single reward to get to, but I've adapted it to my situation. Another important change is the DQN class (Deep Q Network): the original one has 2 layers, but mine has 3 with 128 neurons. Here is its definition:

class DQN(nn.Module):
    def __init__(self, n_input, n_output):
        super(DQN, self).__init__()
        self.layer1 = nn.Linear(n_input, 128)
        self.layer2 = nn.Linear(128, 128)
        self.layer3 = nn.Linear(128, n_output)
        
    def forward(self, x):
        x = torch.relu(self.layer1(x))
        x = torch.relu(self.layer2(x))
        res = self.layer3(x)
        return res

However, the neural networks don't seem to learn, and the results are pretty inconsistent: when training (general case), the steps taken in the first 300 episodes vary from 100 to 600, then it continues hitting MAX_STEPS (which I've set to 1000), and the final ~100 episodes are completed with 100 or so steps. To me, it doesn't look like it's learning at all. It all just seems random.

Shown below here is the code used in the optimize() function inside the agent, which is responsible for updating the neural networks. I added some comments to help you (and me) figure out what the different sections do. Also, here are some clarifications about the rest of the code in the program:

Code:

def optimize(self, ever_won):
        # checks if it has enough experience and has won at least once
        if(self.batch_size>len(self.memory)) or (not ever_won): 
            return

    mini_batch = self.memory.sample(self.batch_size) # takes a sample from its memory
    current_q_list = []
    target_q_list = []
    
    for state, reward, action, new_state, terminated in mini_batch: # cycles the experiences
        # if the experience led to a victory
        if terminated: 
            target = torch.FloatTensor([5]) # I assign a higher priority to the action that made it win

        # otherwise, the q value is calulated
        else:
            with torch.no_grad():
                target = torch.FloatTensor(
                    reward + self.gamma * self.target_dqn(self.state_to_tensor(new_state)).max()
                )

        # get the q values from policy_dqn (main network)
        current_q = self.policy_dqn(self.state_to_tensor(state))
        current_q_list.append(current_q)

        # q values from target_dqn
        target_q = self.target_dqn(self.state_to_tensor(state)) 
        # adjust the q value of the action
        target_q[action] = target
        target_q_list.append(target_q)
    
    # Compute loss for the whole minibatch
    loss = self.loss_fn(torch.stack(current_q_list), torch.stack(target_q_list))
    # saving loss for later plotting
    self.losses.append(loss.item()) # self.losses = list()
    # Optimize the model
    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()
    # epsilon decay
    self.epsilon = max(self.epsilon_min, self.epsilon*self.epsilon_decay) 

Edit 12 feb 2025: I've tried plotting loss using matplotlib.pyplot and the graph below is the result. loss gets saved in a list in optimize()'s for loop (see the line I've added at the end of the code above). As you can see there are many spikes; I guess that's from when during the training the model keeps hitting MAX_STEPS.

screenshot of loss graph

Upvotes: 3

Views: 80

Answers (0)

Related Questions