If Keras results are not reproducible, what's the best practice for comparing models and choosing hyper parameters?

UPDATE: This question was for Tensorflow 1.x. I upgraded to 2.0 and (at least on the simple code below) the reproducibility issue seems fixed on 2.0. So that solves my problem; but I'm still curious about what "best practices" were used for this issue on 1.x.

Training the exact same model/parameters/data on keras/tensorflow does not give reproducible results and the loss is significantly different each time you train the model. There are many stackoverflow questions about that (eg, How to get reproducible results in keras ) but the recommend workarounds don't seem to work for me or many other people on StackOverflow. OK, it is what it is.

But given that limitation of non-reproducibility with keras on tensorflow -- what's the best practice for comparing models and choosing hyper parameters? I'm testing different architectures and activations, but since the loss estimate is different each time, I'm never sure if one model is better than the other. Is there any best practice for dealing with this?

I don't think the issue has anything to do with my code, but just in case it helps; here's a sample program:

import os
#stackoverflow says turning off the GPU helps reproducibility, but it doesn't help for me
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ['PYTHONHASHSEED']=str(1)

import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.layers 
import random
import pandas as pd
import numpy as np

#StackOverflow says this is needed for reproducibility but it doesn't help for me
from tensorflow.keras import backend as K
config = tf.ConfigProto(intra_op_parallelism_threads=1,inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=config)
K.set_session(sess)

#make some random data
NUM_ROWS = 1000
NUM_FEATURES = 10
random_data = np.random.normal(size=(NUM_ROWS, NUM_FEATURES))
df = pd.DataFrame(data=random_data, columns=['x_' + str(ii) for ii in range(NUM_FEATURES)])
y = df.sum(axis=1) + np.random.normal(size=(NUM_ROWS))

def run(x, y):
    #StackOverflow says you have to set the seeds but it doesn't help for me
    tf.set_random_seed(1)
    np.random.seed(1)
    random.seed(1)
    os.environ['PYTHONHASHSEED']=str(1)

    model = keras.Sequential([
            keras.layers.Dense(40, input_dim=df.shape[1], activation='relu'),
            keras.layers.Dense(20, activation='relu'),
            keras.layers.Dense(10, activation='relu'),
            keras.layers.Dense(1, activation='linear')
        ])
    NUM_EPOCHS = 500
    model.compile(optimizer='adam', loss='mean_squared_error')
    model.fit(x, y, epochs=NUM_EPOCHS, verbose=0)
    predictions = model.predict(x).flatten()
    loss = model.evaluate(x,  y) #This prints out the loss by side-effect

#Each time we run it gives a wildly different loss. :-(
run(df, y)
run(df, y)
run(df, y)

Given the non-reproducibility, how can I evaluate whether changes in my hyper-parameters and architecture are helping or not?

Upvotes: 5

Answers (4)

Oscar Monge

Reputation: 91

Putting only the code of below, it works. The KEY of the question, VERY IMPORTANT, is to call the function reset_seeds() every time before running the model. Doing that you will obtain reproducible results as I checked in the Google Collab.

import numpy as np
import tensorflow as tf
import random as python_random

def reset_seeds():
   np.random.seed(123) 
   python_random.seed(123)
   tf.random.set_seed(1234)

reset_seeds()

Upvotes: 1

user2543623

Reputation: 1562

The problem appears to be solved in Tensorflow 2.0 (at least on simple models)! Here is a code snippet that seems to yield repeatable results.

import os
####*IMPORANT*: Have to do this line *before* importing tensorflow
os.environ['PYTHONHASHSEED']=str(1)

import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.layers 
import random
import pandas as pd
import numpy as np

def reset_random_seeds():
   os.environ['PYTHONHASHSEED']=str(1)
   tf.random.set_seed(1)
   np.random.seed(1)
   random.seed(1)

#make some random data
reset_random_seeds()
NUM_ROWS = 1000
NUM_FEATURES = 10
random_data = np.random.normal(size=(NUM_ROWS, NUM_FEATURES))
df = pd.DataFrame(data=random_data, columns=['x_' + str(ii) for ii in range(NUM_FEATURES)])
y = df.sum(axis=1) + np.random.normal(size=(NUM_ROWS))

def run(x, y):
    reset_random_seeds()

    model = keras.Sequential([
            keras.layers.Dense(40, input_dim=df.shape[1], activation='relu'),
            keras.layers.Dense(20, activation='relu'),
            keras.layers.Dense(10, activation='relu'),
            keras.layers.Dense(1, activation='linear')
        ])
    NUM_EPOCHS = 500
    model.compile(optimizer='adam', loss='mean_squared_error')
    model.fit(x, y, epochs=NUM_EPOCHS, verbose=0)
    predictions = model.predict(x).flatten()
    loss = model.evaluate(x,  y) #This prints out the loss by side-effect

#With Tensorflow 2.0 this is now reproducible! 
run(df, y)
run(df, y)
run(df, y)

Upvotes: 4

learningthemachine

Reputation: 614

You have a couple option for stabilizing performance...

1) Set the seed for your intializers so they are always initialized to the same values.

2) More data generally results in a more stable convergence.

3) Lower learning rates and bigger batch sizes are also good for more predictable learning.

4) Training based on a fixed amount of epochs instead of using callbacks to modify hyperparams during train.

5) K-fold validation to train on different subsets. The average of these folds should result in a fairly predictable metric.

6) Also you have the option of just training multiple times and taking an average of this.

Upvotes: 0

OverLordGoldDragon

Reputation: 19796

It's sneaky, but your code does, in fact, lack a step for better reproducibility: resetting the Keras & TensorFlow graphs before each run. Without this, tf.set_random_seed() won't work properly - see correct approach below.

I'd exhaust all the options before tossing the towel on non-reproducibility; currently I'm aware of only one such instance, and it's likely a bug. Nonetheless, it's possible you'll get notably differing results even if you follow through all the steps - in that case, see "If nothing works", but each is clearly not very productive, thus it's best on focusing attaining reproducibility:

Definitive improvements:

Use reset_seeds(K) below
Increase numeric precision: K.set_floatx('float64')
Set PYTHONHASHSEED before the Python kernel starts - e.g. from terminal
Upgrade to TF 2, which includes some reproducibility bug fixes, but mind performance
Run CPU on a single thread (painfully slow)
Do not import from tf.python.keras - see here
Ensure all imports are consistent (i.e. don't do from keras.layers import ... and from tensorflow.keras.optimizers import ...)
Use a superior CPU - for example, Google Colab, even if using GPU, is much more robust against numeric imprecision - see this SO

Also see related SO on reproducibility

If nothing works:

Rerun X times w/ exact same hyperparameters & seeds, average results
K-Fold Cross-Validation w/ exact same hyperparameters & seeds, average results - superior option, but more work involved

Correct reset method:

def reset_seeds(reset_graph_with_backend=None):
    if reset_graph_with_backend is not None:
        K = reset_graph_with_backend
        K.clear_session()
        tf.compat.v1.reset_default_graph()
        print("KERAS AND TENSORFLOW GRAPHS RESET")  # optional

    np.random.seed(1)
    random.seed(2)
    tf.compat.v1.set_random_seed(3)
    print("RANDOM SEEDS RESET")  # optional

Running TF on single CPU thread: (code for TF1-only)

session_conf = tf.ConfigProto(
      intra_op_parallelism_threads=1,
      inter_op_parallelism_threads=1)
sess = tf.Session(config=session_conf)

Upvotes: 10

If Keras results are not reproducible, what&#39;s the best practice for comparing models and choosing hyper parameters?

Answers (4)

Related Questions

If Keras results are not reproducible, what's the best practice for comparing models and choosing hyper parameters?