Sanyo
Sanyo

Reputation: 71

MirroredStrategy without NCCL

I would like to use MirroredStrategy to use multiple GPUs in the same machine. I tried one of the examples: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/distribute/python/examples/simple_tfkeras_example.py

The result is: ValueError: Op type not registered 'NcclAllReduce' in binary running on RAID. Make sure the Op and Kernel are registered in the binary running in this process. while building NodeDef 'NcclAllReduce'

I am using Windows, therefore Nccl is not available. Is it possible to force TensorFlow not to use this library?

Upvotes: 5

Views: 3105

Answers (1)

Austin
Austin

Reputation: 862

There are some binaries for NCCL on Windows, but they can be quite annoying to deal with.

As an alternative, Tensorflow gives you three other options in MirroredStrategy that are compatible with Windows natively. They are Hierarchical Copy, Reduce to First GPU, and Reduce to CPU. What you are most likely looking for is Hierarchical Copy, but you can test each of them to see what gives you the best result.

If you are using tensorflow versions older than 2.0, you will use tf.contrib.distribute:

# Hierarchical Copy
cross_tower_ops = tf.contrib.distribute.AllReduceCrossTowerOps(
        'hierarchical_copy', num_packs=number_of_gpus))
    strategy = tf.contrib.distribute.MirroredStrategy(cross_tower_ops=cross_tower_ops)

# Reduce to First GPU
cross_tower_ops = tf.contrib.distribute. ReductionToOneDeviceCrossTowerOps()
strategy = tf.contrib.distribute.MirroredStrategy(cross_tower_ops=cross_tower_ops)

# Reduce to CPU
cross_tower_ops = tf.contrib.distribute. ReductionToOneDeviceCrossTowerOps(
    reduce_to_device="/device:CPU:0")
strategy = tf.contrib.distribute.MirroredStrategy(cross_tower_ops=cross_tower_ops)

After 2.0, you only need to use tf.distribute! Here is an example setting up an Xception model with 2 GPUs:

strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"], 
                                          cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
with strategy.scope():
    parallel_model = Xception(weights=None,
                              input_shape=(299, 299, 3),
                              classes=number_of_classes)
    parallel_model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

Upvotes: 5

Related Questions