Implement LSTM time_step size for Reinforcement Learning

Question

I am right now trying to implement my LSTM network for my robot and a navigation task. I already applied DDQN with a normal feed forward NN and want to compare that one with a LSTM network. I also made it to implement the LSTM in the most easy form with a sample and time_step size of 1 and a feature size of 364 => (1,1,364). According to my research that would just be a standard FFNN due to the single time_step and sample.

That's why I want to increase the time_step size to i.e. 10 steps. But now another problem occured. Within the DDQN, I applied batch learning with the batch_size = 64. However, if I now choose a time_step size of 10, that would not work anymore because the LSTM network wants to have 9 additional timesteps now on top of the sample.

That is where my question arises. If i have a random batch and a memory that includes information for one point of time (state, next_state, action, reward, done all at a certain time t), do I need to collect a random sample and then choose the 9 data points before that sample so that I got a batch size of (64, 10, 364)?

Kind regards

Latest code:

def buildModel(self):
        model = Sequential()
        model.add(LSTM(64, input_shape=(1,364), return_sequences=True))
        model.add(LSTM(64))
        model.add(Dense(self.action_size, kernel_initializer='lecun_uniform'))
        model.add(Activation('linear'))
        model.compile(loss='mse', optimizer=RMSprop(lr=self.learning_rate, rho=0.9, epsilon=1e-06))
        model.summary()

Code for training:

def trainModel(self, target=False):
        mini_batch = random.sample(self.memory, self.batch_size)
        X_batch = np.empty((0, self.state_size), dtype=np.float64)
        Y_batch = np.empty((0, self.action_size), dtype=np.float64)
        Z_batch = np.empty((0, 1), dtype=np.float64)

        for i in range(self.batch_size):
            states = mini_batch[i][0]
            actions = mini_batch[i][1]
            rewards = mini_batch[i][2]
            next_states = mini_batch[i][3]
            dones = mini_batch[i][4]

            q_value = self.model.predict(states.reshape((1,1, len(states))))


            if target:
                next_target = self.target_model.predict(next_states.reshape((1,1, len(next_states))))


            else:
                next_target = self.model.predict(next_states.reshape((1,1, len(next_states))))

            next_q_value = self.getQvalue(rewards, next_target, dones)

            X_batch = np.append(X_batch, np.array([states.copy()]), axis=0)
            Y_sample = q_value.copy()

            Y_sample[0][actions] = next_q_value
            Y_batch = np.append(Y_batch, np.array([Y_sample[0]]), axis=0)


            if dones:
                X_batch = np.append(X_batch, np.array([next_states.copy()]), axis=0)
                Y_batch = np.append(Y_batch, np.array([[rewards] * self.action_size]), axis=0)

        X_batch = X_batch.reshape((len(Y_batch), 1, len(next_states)))

        self.model.fit(X_batch, Y_batch, batch_size=self.batch_size, epochs=1, verbose=0)

Implement LSTM time_step size for Reinforcement Learning

Answers (1)

Related Questions