Shlomi Schwartz
Shlomi Schwartz

Reputation: 8913

Keras - Hyper Tuning the initial state of the model

I've written an LSTM model that predicts the sequential data.

def get_model(config, num_features, output_size):
    opt = Adam(learning_rate=get_deep(config, 'hp.learning_rate'), beta_1=get_deep(config, 'hp.beta_1'))

    inputs = Input(shape=[None, num_features], dtype=tf.float32, ragged=True)
    layers = LSTM(get_deep(config, 'hp.lstm_neurons'), activation=get_deep(config, 'hp.lstm_activation'))(
        inputs.to_tensor(), mask=tf.sequence_mask(inputs.row_lengths()))

    layers = BatchNormalization()(layers)
    if 'dropout_rate' in config['hp']:
        layers = Dropout(get_deep(config, 'hp.dropout_rate'))(layers)

    for layer in get_deep(config, 'hp.dense_layers'):
        layers = Dense(layer['neurons'], activation=layer['activation'])(layers)
        layers = BatchNormalization()(layers)
        if 'dropout_rate' in layer:
            layers = Dropout(layer['dropout_rate'])(layers)

    layers = Dense(output_size, activation='sigmoid')(layers)
    model = Model(inputs, layers)
    model.compile(loss='mse', optimizer=opt, metrics=['mse'])
    model.summary()
    return model

I've tuned some of the layer's params using AWS SageMaker. While validating the model I've run a model with a specific configuration several times. Most of the time the results are similar, however, one run was much better than others, which led me to think that the initial state of the model is probably crucial in order to get the best performance.

As suggested in this video, weight initialization can provide some performance boost. I've googled around and found layer weight initializers, but I'm not sure what ranges should I tune.


Update: As suggested in some of the comments / answers I'm using a fixed seed to "lock" the model results:

# Set `python` built-in pseudo-random generator at a fixed value
random.seed(seed_value)
# Set `numpy` pseudo-random generator at a fixed value
np.random.seed(seed_value)
# Set `tensorflow` pseudo-random generator at a fixed value
tf.random.set_seed(seed_value)

The results replicate for each new train, however, different seeds can produce much better results than others. So how do I find/tune the best seed?

Upvotes: 1

Views: 1156

Answers (5)

tdMJN6B2JtUe
tdMJN6B2JtUe

Reputation: 428

Short answer: you can neither efficiently nor effectively tune the seed for a pseudo-random number generator. It is not only infeasible due to the extremely large search space, but also impractical for many other reasons, including the fact that pseudo-random number generator implementations change from time to time so you would need to start over every time that happened.

If, for some reason, you are hell-bent on discovering this for yourself, I recommend using NumPy's default_rng object to be the single source of all pseudo-randomness in your algorithm. Then, based on a single seed, you can produce other seeds deterministically for use with, say, tf.random.set_seed.

Upvotes: 1

Innat
Innat

Reputation: 17239

... which led me to think that the initial state of the model is probably crucial in order to get the best performance. ..... As suggested in this video, weight initialization can provide some performance boost. I've googled around and found layer weight initializers, but I'm not sure what ranges should I tune.

Firstly, in that video, apart from the state or weights initializer, all the other factors such as learning rate, schedule, optimizer, batch size, loss function, model depth, etc are something you should play with them to find the best set (we will talk about the role of seed later). Normally, we don't need to tune the default weight or state initializer as those are currently the best; and as usual, this state initialization is a research problem.

Secondly, in , the default weight initializer for Convolution, Dense and RNN-GRU/LSTM is glorot_uniform, also known as Xavier uniform initializer. And the default bias initializer is zeros. If you follow the source code of LSTM (in your case), you would find them. About it, according to the doc

Draws samples from a uniform distribution within [-limit, limit], where limit = sqrt(6 / (fan_in + fan_out)) (fan_in is the number of input units in the weight tensor and fan_out is the number of output units).

Now, you may already notice that this initializer is inherited from the VarianceScaling; and same as GlorotUniform, others like GlorotNormal, LecunNormal, LecunUniform, HeNormal, HeUniform are also inheriting it. Regarding the VarianceScaling, here is listed the supported parameter. For example, technically, the following two are the same.

# in case if you want to try various initializer - 
# use VarianceScaling by passing proper parameter. 
# ie. tf.keras.layers.LSTM(..., kernel_initializer=initializer)
# bur recommended to stick with glorot_uniform (default)
initializer = tf.keras.initializers.VarianceScaling(scale=1., 
                                                    mode='fan_avg', seed=101,
                                                    distribution='uniform')
print(initializer(shape=(2, 2)))


initializer = tf.keras.initializers.GlorotUniform(seed=101)
print(initializer(shape=(2, 2)))

tf.Tensor(
[[-1.0027379  1.0746485]
 [-1.2234    -1.1489409]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[-1.0027379  1.0746485]
 [-1.2234    -1.1489409]], shape=(2, 2), dtype=float32)

In short, you can play with tf.keras.initializers.VarianceScaling (at the bottom of the page). Additionally, you can make your own initializer by defining a callable function or by subclassing the Initializer class. For example:

def conv_kernel_initializer(shape, dtype=None):
  kernel_height, kernel_width, _, out_filters = shape
  fan_out = int(kernel_height * kernel_width * out_filters)
  return tf.random.normal(
      shape, mean=0.0, stddev=np.sqrt(2.0 / fan_out), dtype=dtype)

def dense_kernel_initializer(shape, dtype=None):
  init_range = 1.0 / np.sqrt(shape[1])
  return tf.random.uniform(shape, -init_range, init_range, dtype=dtype)

Here is one good article about initializing the weights, you may enjoy reading. Butt again, better to go with default ones.

Thirdly, for setting different seed values and different sets of hyper-parameter, etc, I better leave one of my old answers here, mostly the first diagram probably come in handy to your experiments. One of an approach that I follow is to keep my seed same (let's say for first 5 experiments) and change another factor and log the results. And after 5 iterations, hopefully, we would get some best set and approach further.


Update

Find/Tune Seed. Before searching the method to find the best seed, one must understand that seed is not a hyper-parameter that needs to be tuned with other hyperparameters such as learning rate, scheduler, optimizer, etc.

Here are one scenario, let's say you split the data randomly into two parts with seed 42: train set (70%) and test set (30%) and after training on the train set, you evaluate on the test set of your model and received score 80. Then you change your seed to 101, and again do the same but now you got score 50. Now, this doens't mean picking seed 42 is better; but it simply means your model is unstable and most like won't do well on the unseen data. This is actually a well-known issue if someone randomly split their data set for training and testing. Why it happens? Because, when you split the data randomly, it's possible that there would be a mismatch in class-distribution. Please, check the following two very related discussion on this:

Upvotes: 3

Antoine Dubuis
Antoine Dubuis

Reputation: 5324

Indeed, the initial state of the model is crucial in order to get the best performance. Deep Learning works by optimizing a non-convex loss function in order to find the best local minima.

The initial weights will define the starting location of the optimization. As defined in the picture below. The starting point is defined by the initial weights and training the model will make it reach the local minima. As you see there is a starting weight configuration that allows reaching the global minima.

enter image description here

It is sometimes possible to have better weights initialization with Transfer Learning which is reusing the weights of a trained model on a downstream task. (For example VGG-16 in image classification, or NLP with BERT).

In your case, You should not try to finetune the weight initialization as this is meant to be random. Changing the architecture of your neural network, or its hyperparameter will certainly lead to better performance improvement.

Upvotes: 2

user5178150
user5178150

Reputation:

maybe you search for exponential decay learning rate. let me explain for example you first epoch has sometimes a loss of 3000, 4000 , sometimes just 500. if you run a model often, you probably recognize a "real barrier", where you dont say "thats because of the initial state" anymore. you want to go fast there, but dont keep the bad side effects of high learning rate (e.g. 1E-3) , you more want 1E-5. there the exponential decay come in place.

call an instance of myLr=tf.train.exponential_decay(...) and pass it instead of the numerical learning rate parameter to your optimizer

for example Adam(myLr)

Upvotes: 2

Harris Minhas
Harris Minhas

Reputation: 790

I dont think there is a "one shoe fits all" solution to this issue. The initial weights heavily depend on the kind of problem at hand and the data that we are using to solve that problem. All we can do is point you towards a good resource from where you can try to see which of the approaches mentioned fits your problem. The following article is a good resource that not only provides you with a detailed understanding of how and why to initialize weights but also points towards peer reviewed research that can help build an academic understanding.

Upvotes: 3

Related Questions