Reputation: 61
I am using gymnasium and torch. I initially created a custom environment by following the official gymnasium's guide: it shows you how to build a NxN Box
where the agent needs to reach a randomly placed target by moving right, up, left or down.
I modified it so that the agent needs to reach multiple targets in order to finish the episode. From the guide you can see that the state of the environment is a dictionary containing the agent's and the target's coordinates; I changed the space type from Box
to Sequence
, which allows me to specify an indefinite number of observations:
self.observation_space = gym.spaces.Dict(
{
"agent": gym.spaces.Box(0, size-1, shape=(2,), dtype=int),
"targets": gym.spaces.Sequence(
gym.spaces.Box(0, size-1, shape=(2,), dtype=int) # element type
)
})
I took the agent's code from here and adapted it to my environment. Most importantly, I've replaced the replay()
method with optimize()
, which I took from this github repository (shown in this video). The code is originally for gymnasium's FrozenLake
environment, where there are holes you can fall into and a single reward to get to, but I've adapted it to my situation. Another important change is the DQN class (Deep Q Network): the original one has 2 layers, but mine has 3 with 128 neurons. Here is its definition:
class DQN(nn.Module):
def __init__(self, n_input, n_output):
super(DQN, self).__init__()
self.layer1 = nn.Linear(n_input, 128)
self.layer2 = nn.Linear(128, 128)
self.layer3 = nn.Linear(128, n_output)
def forward(self, x):
x = torch.relu(self.layer1(x))
x = torch.relu(self.layer2(x))
res = self.layer3(x)
return res
However, the neural networks don't seem to learn, and the results are pretty inconsistent: when training (general case), the steps taken in the first 300 episodes vary from 100 to 600, then it continues hitting MAX_STEPS
(which I've set to 1000), and the final ~100 episodes are completed with 100 or so steps. To me, it doesn't look like it's learning at all. It all just seems random.
Shown below here is the code used in the optimize()
function inside the agent, which is responsible for updating the neural networks.
I added some comments to help you (and me) figure out what the different sections do.
Also, here are some clarifications about the rest of the code in the program:
MAX_STEPS
; at that point it breaks the cycle and resets the environment[0, 0, 2, 3, 1, 6]
(the first 2 are [x,y] coords of the agent, the next 2 are the first target's,...)-1
as I cannot change the number of input neuronsterminated
set to True, it knows the action that has been made led to victory, and it assigns a higher priority.Code:
def optimize(self, ever_won):
# checks if it has enough experience and has won at least once
if(self.batch_size>len(self.memory)) or (not ever_won):
return
mini_batch = self.memory.sample(self.batch_size) # takes a sample from its memory
current_q_list = []
target_q_list = []
for state, reward, action, new_state, terminated in mini_batch: # cycles the experiences
# if the experience led to a victory
if terminated:
target = torch.FloatTensor([5]) # I assign a higher priority to the action that made it win
# otherwise, the q value is calulated
else:
with torch.no_grad():
target = torch.FloatTensor(
reward + self.gamma * self.target_dqn(self.state_to_tensor(new_state)).max()
)
# get the q values from policy_dqn (main network)
current_q = self.policy_dqn(self.state_to_tensor(state))
current_q_list.append(current_q)
# q values from target_dqn
target_q = self.target_dqn(self.state_to_tensor(state))
# adjust the q value of the action
target_q[action] = target
target_q_list.append(target_q)
# Compute loss for the whole minibatch
loss = self.loss_fn(torch.stack(current_q_list), torch.stack(target_q_list))
# saving loss for later plotting
self.losses.append(loss.item()) # self.losses = list()
# Optimize the model
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# epsilon decay
self.epsilon = max(self.epsilon_min, self.epsilon*self.epsilon_decay)
Edit 12 feb 2025: I've tried plotting loss using matplotlib.pyplot
and the graph below is the result. loss
gets saved in a list in optimize()
's for loop (see the line I've added at the end of the code above). As you can see there are many spikes; I guess that's from when during the training the model keeps hitting MAX_STEPS
.
Upvotes: 3
Views: 80