HY G
HY G

Reputation: 305

does tensorflow 0.10.0rc version support float16?

In order to reduce the tensor, I defined all the variables with dytpe=tf.float16 in my Model, and then defined the optimizer:

optimizer = tf.train.AdamOptimizer(self.learning_rate)
self.compute_gradients = optimizer.compute_gradients(self.mean_loss_reg)
train_adam_op = optimizer.apply_gradients(self.compute_gradients, global_step=self.global_step)

Everything works ok! but after I run the train_adam_op, the the gradients and variables are nan in python. I wander If the apply_gradients() API supports tf.float16 type? Why I got nan after apply_gradients() was called by session.run()....

Upvotes: 1

Views: 901

Answers (1)

Benoit Steiner
Benoit Steiner

Reputation: 1469

The dynamic range of fp16 is fairly limited compared to that of 32-bit floats. As a result, it's pretty easy to overflow or underflow them, which often results in the NaN that you've encountered.

You can insert a few check_numerics operations in your model to help pinpoint the specific operation(s) that becomes unstable when performed on fp16.

For example, you can wrap a L2 loss operation as follow to check that its result fits in an fp16

A = tf.l2_loss(some_tensor)

becomes

A = tf.check_numerics(tf.l2_loss(some_tensor), "found the root cause")

The most common source of overflows and underflows are the exp(), the log(), as well as the various classification primitives, so I would start looking there.

Once you've figured out which sequence of operations is problematic, you can update your model to perform that sequence using 32-bit floats by using tf.cast() to convert the inputs of the sequence to 32bit floats, and cast the result back to fp16.

Upvotes: 4

Related Questions