Reputation: 390
I have been wanting to increase my batch size to improve the generalization of my model (it's very batch size sensitive). The solution for this is to go multi-GPU in order to utilize more memory. I am using tensorflow.keras (with tensorflow 2.1 on Windows 10) in my script, and follow the instructions for configuring mirrored strategy for my model. The issue is that my training script runs perfectly fine without the mirrored strategy code, but with the mirrored strategy, I get an error regarding NCCL. This looks to be the exact same issue as:
https://github.com/tensorflow/tensorflow/issues/21470
Unfortunately, the solution discussed in that link:
cross_tower_ops = tf.contrib.distribute.AllReduceCrossDeviceOps(
'hierarchical_copy', num_packs=num_gpus))
strategy = tf.contrib.distribute.MirroredStrategy(cross_tower_ops=cross_tower_ops)
Does not work with tf 2.1 since the 'contrib' portion of tf appears to have been removed. Does anyone know what the replacement fix is for NCCL on Windows, or the replacement for the 'contrib' portion of tf that is gone?
Upvotes: 3
Views: 3280
Reputation: 26335
In my experience some cross_device_ops
would not work and produce errors.
This option was meant for NVIDIA DGX-1 architecture and might underperform on other architectures :
strategy = tf.distribute.MirroredStrategy(
cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
Should work :
strategy = tf.distribute.MirroredStrategy(
cross_device_ops=tf.distribute.ReductionToOneDevice())
Would not work with my configuration :
strategy = tf.distribute.MirroredStrategy(
cross_device_ops=tf.distribute.NcclAllReduce())
So that it can be advised to try the different options.
Upvotes: 1
Reputation: 1794
One solution from issue 21470 is to build nccl for Winx64. MyCaffe provides instructions for that here: https://github.com/MyCaffe/NCCL/blob/master/INSTALL.md
You'll need VS 2015, 2017, CUDA development package, and to put the produced .dlls in the correct location once compiled.
Upvotes: 1