Mega
Mega

Reputation: 545

Random behavior when restoring saved Tensorflow model

I have a stored Tensorflow model, which I would like to evaluate deterministically for final predictions. When restoring the model and running predictions, there is a point in the network flow, where tensor values are (unexpectedly) computed in a non-deterministic way.

This is the problematic point:

self.h0 = tf.concat([self.q_weighted, self.x_weighted], 1, name='h0')
self.h1 = tf.layers.dense(inputs=self.h0, units=512, activation=tf.nn.relu, name='h1',kernel_initializer=self.kernel_initializer, bias_initializer=self.bias_initializer)

Where:

self.kernel_initializer = tf.glorot_uniform_initializer()
self.bias_initializer = tf.truncated_normal_initializer(mean=0.011, stddev=0.005)

Comparing multiple executions with the same input, the resulting values of h0 are consistent, while those of h1 vary.

The way I build the graph and restore the model:

  1. Building model graph (including, for example, the two variables mentioned above). I create init op (tf.global_variables_initializer()) but don't run it here (only when training)
  2. Initialize a session
  3. Loading trained model
  4. Run ops to get predictions

The code:

// building network graph
// ...

// restoring trained model
self.saver = tf.train.Saver(max_to_keep=2)
self.sess = tf.Session()
self.saver.restore(self.sess, model_path)

// running network ops (without running tf.global_variables_initializer)
self.sess.run([...])

I manually checked the restored weights (kernel and bias) of h0 and h1 in two separate executions, and they are the same after restoring from the checkpoint.

Any ideas what would cause this? or how to handle this so the executions will be deterministic?

P.S - I also tried to set a constant global Tensorflow and Numpy seed. That didn't help.

** EDIT **


Going systematically over the network layers I have found that the first non-deterministic op is reduce_sum. Concretely, this line of code:

self.x_weighted = tf.reduce_sum(tf.multiply(tf.expand_dims(self.x_weights_norm, -1), x_outputs), axis=1, name="x_weighted")

I saw that this is a known issue - see here and here. Yet, this behavior is reproduced on a single CPU, while limiting the number of threads to 1, like this:

config = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1, allow_soft_placement=True, device_count={'CPU': 1})
self.sess = tf.Session(config=config)

Now, I wonder whether there is another part which is not set right, e.g. causes randomness, or the reduce_sum non-determinism still occurs even with this configuration.

Upvotes: 1

Views: 554

Answers (1)

Mega
Mega

Reputation: 545

Problem solved. The randomness was due to a usage of python hash function, applied to the input of the network. By fixing the PYTHONHASHSEED environment variable, the output has become consistent across different executions.

Upvotes: 0

Related Questions