pyxies
pyxies

Reputation: 404

How to use Automatic Mixed Precision in tensorflow 2.0 with hub.KerasLayer

According to the tensorflow documentation, I tried to use Automatic Mixed Precision (AMP) in tensorflow 2.0 in keras style. Here is my code:

#!/usr/bin/env python
# coding: utf-8
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow_hub as hub
import tensorflow.keras.mixed_precision.experimental as mixed_precision
import tensorflow.keras.layers as layers
import numpy as np
import tensorflow as tf

# we can use mixed precision with the following line
policy = mixed_precision.Policy('mixed_float16')
# policy = mixed_precision.Policy('float32')
mixed_precision.set_policy(policy)
print('Compute dtype: %s' % policy.compute_dtype)
print('Variable dtype: %s' % policy.variable_dtype)
num_samples = 1024
batch_size = 16
max_seq_len = 128
num_class = 16
epochs = 3
vocab_size = 30522

# BERT_PATH = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1'
BERT_PATH = '../input/bert-base-from-tfhub/bert_en_uncased_L-12_H-768_A-12'


def bert_model():
    input_ids = tf.keras.Input((max_seq_len,), dtype=tf.int32, name='input_ids')
    input_masks = tf.keras.Input((max_seq_len,), dtype=tf.int32, name='input_masks')
    input_segments = tf.keras.Input((max_seq_len,), dtype=tf.int32, name='input_segments')

    bert_layer = hub.KerasLayer(BERT_PATH, trainable=True)

    print('bert_layer._dtype_policy:', bert_layer._dtype_policy)
    print('bert_layer._compute_dtype:', bert_layer._compute_dtype)
    print('bert_layer._dtype:', bert_layer._dtype)

    _, bert_sequence_output = bert_layer([input_ids, input_masks, input_segments])

    print("bert_sequence_output.dtype:", bert_sequence_output.dtype)

    x = layers.GlobalAveragePooling1D()(bert_sequence_output)
    logits = layers.Dense(num_class, name="logits")(x)

    print("logits.dtype:", logits.dtype)

    # when using mixed precision, regardless of what your model ends in, make sure the output is float32.
    output = layers.Activation('sigmoid', dtype='float32', name='output')(logits)
    print('output.dtype:', output.dtype)

    model = tf.keras.models.Model(inputs=[input_ids, input_masks, input_segments], outputs=output)
    return model


# make dummy inputs
train_X = []
train_X.append(np.random.randint(0, vocab_size, size=(num_samples, max_seq_len)))  # ids
train_X.append(np.zeros(shape=(num_samples, max_seq_len)))  # masks
train_X.append(np.zeros(shape=(num_samples, max_seq_len)))  # segments
train_Y = np.random.randn(num_samples, num_class)  # labels

model = bert_model()
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-4)
model.compile(loss="binary_crossentropy", optimizer=optimizer)
model.fit(train_X, train_Y, epochs=epochs, verbose=1, batch_size=batch_size)

What I expect :

bert_sequence_output.dtype should be float16, because it is the output of a layer (i.e. the bert_layer) that uses the mixed_float16 policy.

But what I actually get:

the code above tells me bert_sequence_output.dtype is float32, and here is the full log:

ssh://[email protected]:22/home/xiepengyu/miniconda3/envs/tf2/bin/python -u /home/xiepengyu/google_quest/scripts/multi_bert_aug_mixed_precision_test.py
2020-01-05 11:30:50.951010: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-01-05 11:30:51.380306: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/xiepengyu/cuda/cuda-10.1/lib64:$LD_LIBRARY_PATH
2020-01-05 11:30:51.380387: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/xiepengyu/cuda/cuda-10.1/lib64:$LD_LIBRARY_PATH
2020-01-05 11:30:51.380399: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-01-05 11:30:52.292392: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-01-05 11:30:52.635553: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:03:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
2020-01-05 11:30:52.635599: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-01-05 11:30:52.637236: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-01-05 11:30:52.638264: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-01-05 11:30:52.638493: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-01-05 11:30:52.640188: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-01-05 11:30:52.641278: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-01-05 11:30:52.644628: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-05 11:30:52.650678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-01-05 11:30:52.650998: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-01-05 11:30:52.658229: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3499720000 Hz
2020-01-05 11:30:52.658878: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x562d05824cc0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-01-05 11:30:52.658896: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-01-05 11:30:52.871435: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x562d058cb200 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-01-05 11:30:52.871481: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-01-05 11:30:52.875039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:03:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
2020-01-05 11:30:52.875109: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-01-05 11:30:52.875137: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-01-05 11:30:52.875149: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-01-05 11:30:52.875161: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-01-05 11:30:52.875172: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-01-05 11:30:52.875183: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-01-05 11:30:52.875195: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-05 11:30:52.876635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-01-05 11:30:53.444364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-01-05 11:30:53.444427: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-01-05 11:30:53.444436: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-01-05 11:30:53.450671: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10392 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
Compute dtype: float16
Variable dtype: float32
bert_layer._dtype_policy: <Policy "mixed_float16", loss_scale=DynamicLossScale(current_loss_scale=32768.0, num_good_steps=0, initial_loss_scale=32768.0, increment_period=2000, multiplier=2.0)>
bert_layer._compute_dtype: float16
bert_layer._dtype: float32
bert_sequence_output.dtype: <dtype: 'float32'>
logits.dtype: <dtype: 'float16'>
output.dtype: <dtype: 'float32'>
Train on 1024 samples
Epoch 1/3
2020-01-05 11:31:06.079381: W tensorflow/core/common_runtime/shape_refiner.cc:88] Function instantiation has undefined input shape at index: 1161 in the outer inference context.
/home/xiepengyu/miniconda3/envs/tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:433: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2020-01-05 11:31:08.348584: W tensorflow/core/common_runtime/shape_refiner.cc:88] Function instantiation has undefined input shape at index: 1161 in the outer inference context.
2020-01-05 11:31:18.719649: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
1024/1024 [==============================] - 34s 33ms/sample - loss: 0.0720
Epoch 2/3
1024/1024 [==============================] - 15s 15ms/sample - loss: 0.0185
Epoch 3/3
1024/1024 [==============================] - 15s 15ms/sample - loss: 0.0042

Process finished with exit code 0

When I change the policy to float32, the several print give me these info (other parts of log is the same as mixed_float16):

Compute dtype: float32
Variable dtype: float32
bert_layer._dtype_policy: <Policy "float32", loss_scale=None>
bert_layer._compute_dtype: float32
bert_layer._dtype: float32
bert_sequence_output.dtype: <dtype: 'float32'>
logits.dtype: <dtype: 'float32'>
output.dtype: <dtype: 'float32'>

According to the log, here is my conclusion:

  1. The mixed_float16 policy dose work in other custom layer, e.g. the Dense layer named "logits", because its output has a dtype float16.

  2. The policy of the Bert layer has been set to mixed_float16, but somehow it seems like it is not working judging from the dtype of bert_sequence_output.dtype is float32. Another evidence is the GPU memory usage (which is dominated by the variables in Bert layer) is nearly the same in both cases.

Personally, I think this is due to the layer definded in Bert have been hard-coded to have a dtype float32, so we cannot use mixed_float policy to change its behavior. Is it right? What else could have caused the problem and how to fix it?

Thanks to all kindly help in advance!

Upvotes: 2

Views: 1474

Answers (2)

MyungHa Kwon
MyungHa Kwon

Reputation: 23

To the best of my knowledge,

bert_sequence_output.dtype and output.dtype don't need to be float16 during mixed precision training. You can check this on the document you linked - tensorflow doc.

The reason is that some computation result should be fp32 due to overflow issue - for normalization and softmax which can causes overflow while summing all elements of big matrix.

I think you'd better check training speed to see the policy works instead of variable type.

And if your GPU is 1080 TI, there would be little improvement as it doesn't have Tensor Core which is designed for fast fp16 calculation. But it does support mixed precision training. Only speed differs.

Upvotes: 0

Jeff Zhou
Jeff Zhou

Reputation: 21

The GTX 1080 Ti does not support mixed precision training. You need a NVIDIA RTX graphics card. The 2000 series has tensor cores and hence supports mixed precisions.

Upvotes: 2

Related Questions