Reputation: 425

Keras modification in place - how to train multiple models in Keras correctly

This is about Keras in R, but from my understanding it applies to python as well.

Keras models are modified-in-place. Not initially understanding what this means has caught me out in the past, so I thought I would write down what this means for training multiple models in the same session so that others can avoid making the mistakes I made.

What it means is that you can't copy a model object, such as:

model = keras_model_sequential()
model %>%
   layer_dense(
      units = 50,
      input_shape = 100
   )
model_copy = model

Now, if you try to modify either model or model_copy, the other will be modified as well. The answer below explains why.

Upvotes: 1

Answers (1)

GMSL

Reputation: 425

Introduction

R can either modify-in-place, which means that an object is modified in memory, or create a new copy of an object when it is modified. This is explained here. If you create an object, and another object that points to that first object, they will point to the same place in memory. However, this means that modifying one will modify the other.

To avoid this, R tracks if 1 or > 1 names are pointing to the same place in memory. If it is 1, then it is safe to modify the memory values. This is modification-in-place. However, if it is > 1 then modifying one object would modify the other. Therefore, the object being modified is actually copied to a new part of memory, so that the two objects are no longer pointing to the same part of memory. This means modifying one object does not affect the other.

This is not carried out in Keras in R, and as I understand it (though I haven't used Keras in python yet) it also isn't carried out in python. Keras always uses modification-in-place, regardless of how many names are pointing to the spot in memory. So anything done to one model is also done to the other, because realistically they are just two names for the same model - both "objects" are actually just one object.

Example broken code

To show where this can trip you up, here is an example of training an mnist classification network with comparing two RMSProp learning rates. If one didn't know about modification-in-place in Keras, one might write the code:

library(keras)

# data
mnist = dataset_mnist()
x_train = mnist$train$x
y_train = mnist$train$y
x_train = array_reshape(x_train, c(nrow(x_train), 784))
x_train = x_train / 255
y_train = to_categorical(y_train, 10)

# model
model = keras_model_sequential() 
model %>% 
  layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>% 
  layer_dropout(rate = 0.4) %>% 
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 10, activation = 'softmax')

# compile and train a model, given a learning rate
comp_train = function(learning_rate) {
  model_copy = model
  model_copy %>% compile(
    loss = 'categorical_crossentropy',
    optimizer = optimizer_rmsprop(
      lr = learning_rate
    ),
    metrics = c('accuracy')
  )
  training_history = model_copy %>% fit(
    x_train, y_train, 
    epochs = 30, batch_size = 128, 
    validation_split = 0.2
  )
  return(
    as.data.frame(training_history)
  )
}

# test two learning rates
lr_0.001 = comp_train(0.001)
lr_0.0001 = comp_train(0.0001)

The results make sense for the learning rate of 0.001:

However, for the learning rate of 0.0001 the results are very unexpected:

These results aren't unexpected if one realises that the second image is just continuation of the first image for a further 30 epochs. So, put together, the two images just show training of the same neural net over 60 epochs. This is because of modification-in-place - when you train the "second" network, you're actually just training the first, even though it has already been trained.

Example working code

So what should be done differently? With Keras, different models must each be initialised with keras_model_sequential() or keras_model() (whichever type you use). So we separately define each model:

library(keras)

# data
mnist = dataset_mnist()
x_train = mnist$train$x
y_train = mnist$train$y
x_train = array_reshape(x_train, c(nrow(x_train), 784))
x_train = x_train / 255
y_train = to_categorical(y_train, 10)

# models
model_lr0.001 = keras_model_sequential() 
model_lr0.0001 = keras_model_sequential() 

model_lr0.001 %>% 
  layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>% 
  layer_dropout(rate = 0.4) %>% 
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 10, activation = 'softmax')
model_lr0.0001 %>% 
  layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>% 
  layer_dropout(rate = 0.4) %>% 
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 10, activation = 'softmax')

# compile and train a given model, also given a learning rate
comp_train = function(model, learning_rate) {
  model %>% compile(
    loss = 'categorical_crossentropy',
    optimizer = optimizer_rmsprop(
      lr = learning_rate
    ),
    metrics = c('accuracy')
  )
  training_history = model %>% fit(
    x_train, y_train, 
    epochs = 30, batch_size = 128, 
    validation_split = 0.2
  )
  return(
    as.data.frame(training_history)
  )
}

# test two learning rates
lr_0.001 = comp_train(model_lr0.001, 0.001)
lr_0.0001 = comp_train(model_lr0.0001, 0.0001)

This time, we get the results we would expect:

We can now successfully compare the two learning rates. "Better" working code would be to define the model (with keras_model_sequential()) in the function, which does also give expected results. That is left as an exercise for the reader.