Reputation: 678
I am using docker container to run my experiment. I have multiple GPUs available and I want to use all of them for my experiment. I mean I want to utilize all GPUs for one program. To do so, I used tf.distribute.MirroredStrategy
that suggested on tensorflow site, but it is not working. here is the full error messages on gist.
here is available GPUs info:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:6A:00.0 Off | 0 |
| N/A 31C P8 15W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:6B:00.0 Off | 0 |
| N/A 31C P8 15W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000000:6C:00.0 Off | 0 |
| N/A 34C P8 15W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000000:6D:00.0 Off | 0 |
| N/A 34C P8 15W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
my current attempt
here is my attempt using tf.distribute.MirroredStrategy
:
device_type = "GPU"
devices = tf.config.experimental.list_physical_devices(device_type)
devices_names = [d.name.split("e:")[1] for d in devices]
strategy = tf.distribute.MirroredStrategy(devices=devices_names[:3])
with strategy.scope():
model.compile(optimizer=opt, loss="categorical_crossentropy", metrics=["accuracy"])
The above attempt is not working and gave the error that listed on above gist. I don't find another way of using multiple GPUs for a single experiment.
does anyone any workable approach to make this happens? any thoughts?
Upvotes: 0
Views: 798
Reputation: 5517
The approach is correct, as long as the GPUs are on the same host. The TensorFlow manual has examples how the tf.distribute.MirroredStrategy
can be used with keras to train the MNIST set.
No, there are multiple strategies that can be used to acheive the workload distribution. For example, the tf.distribute.MultiWorkerMirroredStrategy
can also be used to distribute the work on multiple devices trough multiple workers.
The TF documentation explains the strategies, the limitations associated with the strategies and provides some examples to help kick-start the work.
According to the issue from github, the ValueError: SyncOnReadVariable does not support 'assign_add' .....
is a bug in TensorFlow which is fixed in TF 2.4
You can try to upgrade the tensorflow libraries by
pip install --ignore-installed --upgrade tensorflow
If you have tried the standard example from the documentation, and it works fine, but your model is not working, you might be having variables that are incorrectly set-up or you are using distributed variables
that do not have support for the aggregation functions required by the distributed strategy.
As per the TF documentation:
..." A distributed variable is variables created on multiple devices. As discussed in the glossary, mirrored variable and SyncOnRead variable are two examples. "...
To better understand how to implement the custom support for the distributed varialbes, check the following page in the documentation
Upvotes: 1