Sayan Pal
Sayan Pal

Reputation: 4946

How to effectively use batch normalization in LSTM?

I am trying to use batch normalization in LSTM using keras in R. In my dataset the target/output variable is the Sales column, and every row in the dataset records the Sales for each day in a year (2008-2017). The dataset looks like below:

Sales data

My objective is to build a LSTM model based on such dataset, which should be able to provide prediction at the end of training. I am training this model on the data from 2008-2016, and using half of the 2017 data as validation, and the rest as test set.

Previously, I tried creating a model using dropout and early stopping. This looks like below:

mdl1 <- keras_model_sequential()
mdl1 %>%
  layer_lstm(units = 512, input_shape = c(1, 3), return_sequences = T ) %>%  
  layer_dropout(rate = 0.3) %>%
  layer_lstm(units = 512, return_sequences = FALSE) %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 1, activation = "linear")

mdl1 %>% compile(loss = 'mse', optimizer = 'rmsprop')

The model looks as follows

___________________________________________________________
Layer (type)               Output Shape         Param #    
===========================================================
lstm_25 (LSTM)             (None, 1, 512)       1056768    
___________________________________________________________
dropout_25 (Dropout)       (None, 1, 512)       0          
___________________________________________________________
lstm_26 (LSTM)             (None, 512)          2099200    
___________________________________________________________
dropout_26 (Dropout)       (None, 512)          0          
___________________________________________________________
dense_13 (Dense)           (None, 1)            513        
===========================================================
Total params: 3,156,481
Trainable params: 3,156,481
Non-trainable params: 0
___________________________________________________________

To train the model, early stopping is used with a validation set.

mdl1.history <- mdl1 %>% 
  fit(dt.tr, dt.tr.out, epochs=500, shuffle=F,
      validation_data = list(dt.val, dt.val.out),
      callbacks = list(
        callback_early_stopping(min_delta = 0.000001,  patience = 10, verbose = 1)
      ))

On top of this, I want to use batch normalization to speed up the training. As per my understanding, to use batch normalization, I need to divide the data into batches, and apply layer_batch_normalization for the input of each hidden layer. The model layers looks like as follows:

batch_size <- 32
mdl2 <- keras_model_sequential()
mdl2 %>%
  layer_batch_normalization(input_shape = c(1, 3), batch_size = batch_size) %>%

  layer_lstm(units = 512, return_sequences = T) %>%
  layer_dropout(rate = 0.3) %>%
  layer_batch_normalization(batch_size = batch_size) %>%

  layer_lstm(units = 512, return_sequences = F) %>%
  layer_dropout(rate = 0.2) %>%
  layer_batch_normalization(batch_size = batch_size) %>%

  layer_dense(units = 1, activation = "linear")

mdl2 %>% compile(loss = 'mse', optimizer = 'rmsprop')

This model looks as follows:

______________________________________________________________________________
Layer (type)                                    Output Shape       Param #    
==============================================================================
batch_normalization_34 (BatchNormalization)     (32, 1, 3)         12         
______________________________________________________________________________
lstm_27 (LSTM)                                  (32, 1, 512)       1056768    
______________________________________________________________________________
dropout_27 (Dropout)                            (32, 1, 512)       0          
______________________________________________________________________________
batch_normalization_35 (BatchNormalization)     (32, 1, 512)       2048       
______________________________________________________________________________
lstm_28 (LSTM)                                  (32, 1, 512)       2099200    
______________________________________________________________________________
dropout_28 (Dropout)                            (32, 1, 512)       0          
______________________________________________________________________________
batch_normalization_36 (BatchNormalization)     (32, 1, 512)       2048       
______________________________________________________________________________
dense_14 (Dense)                                (32, 1, 1)         513        
==============================================================================
Total params: 3,160,589
Trainable params: 3,158,535
Non-trainable params: 2,054
______________________________________________________________________________

Training the model looks like before. Only difference lies in the training and validation dataset, which are made of sizes that are multiple of batch_size (32 here), by resampling data from the 2nd last batch to the last batch.

However, the performance of mdl1 is much better than that of mdl2, as can be seen below.

models

I am not sure exactly what I am doing wrong, as I am starting with keras (and practical neural net in general). Additionally, the performance of first model is not so good as well; any suggestion on how to improve that would also be great.

Upvotes: 5

Views: 9639

Answers (2)

orlem lima dos santos
orlem lima dos santos

Reputation: 41

The Batch normalization in LSTM is not that easy to implement. Some papers present some amazing results https://arxiv.org/pdf/1603.09025.pdf called Recurrent Batch normalization. The authors apply following the equations

BATCH-NORMALIZED LSTM

Unfortunately, this model is not implemented in keras yet only in tensorflow https://github.com/OlavHN/bnlstm

However, I was able to get good results using (default) batch normalization after the activation function with the without centering and shifting. This approach is different from the paper above applying BN after c_t and h_t, maybe it is worth a try.

model = Sequential()
model.add(LSTM(neurons1,
               activation=tf.nn.relu,
               return_sequences=True,
               input_shape=(timesteps, data_dim)))
model.add(BatchNormalization(momentum=m, scale=False, center=False))
model.add(LSTM(neurons2,
               activation=tf.nn.relu))
model.add(BatchNormalization(momentum=m, scale=False, center=False))
model.add(Dense(1))

Upvotes: 4

Manngo
Manngo

Reputation: 829

I'm using Keras with Python but I can try R. In the fit method the documentation says that it defaults to 32 if omitted. This is no longer true in current version as it can be seen in the source code. I think you should try it like this, at least this way it works in Python:

mdl2 <- keras_model_sequential()
mdl2 %>%
  layer_input(input_shape = c(1, 3))  %>%

  layer_batch_normalization() %>%
  layer_lstm(units = 512, return_sequences = T, dropout=0.3) %>%

  layer_batch_normalization() %>%
  layer_lstm(units = 512, return_sequences = F, dropout=0.2) %>%

  layer_batch_normalization() %>%
  layer_dense(units = 1, activation = "linear")

mdl2 %>% compile(loss = 'mse', optimizer = 'rmsprop')
mdl2.history <- mdl2 %>% 
  fit(dt.tr, dt.tr.out, epochs=500, shuffle=F,
      validation_data = list(dt.val, dt.val.out),
      batch_size=32,
      callbacks = list(
        callback_early_stopping(min_delta = 0.000001,  patience = 10, verbose = 1)
      ))

Upvotes: 1

Related Questions