Hamilton
Hamilton

Reputation: 678

any doable approach to use multiple GPUs, multiple process with tensorflow?

I am using docker container to run my experiment. I have multiple GPUs available and I want to use all of them for my experiment. I mean I want to utilize all GPUs for one program. To do so, I used tf.distribute.MirroredStrategy that suggested on tensorflow site, but it is not working. here is the full error messages on gist.

here is available GPUs info:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:6A:00.0 Off |                    0 |
| N/A   31C    P8    15W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:6B:00.0 Off |                    0 |
| N/A   31C    P8    15W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            Off  | 00000000:6C:00.0 Off |                    0 |
| N/A   34C    P8    15W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            Off  | 00000000:6D:00.0 Off |                    0 |
| N/A   34C    P8    15W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

my current attempt

here is my attempt using tf.distribute.MirroredStrategy:

device_type = "GPU"
devices = tf.config.experimental.list_physical_devices(device_type)
devices_names = [d.name.split("e:")[1] for d in devices]
strategy = tf.distribute.MirroredStrategy(devices=devices_names[:3])

with strategy.scope():
    model.compile(optimizer=opt, loss="categorical_crossentropy", metrics=["accuracy"])

The above attempt is not working and gave the error that listed on above gist. I don't find another way of using multiple GPUs for a single experiment.

does anyone any workable approach to make this happens? any thoughts?

Upvotes: 0

Views: 798

Answers (1)

jordanvrtanoski
jordanvrtanoski

Reputation: 5517

Is the MirrordStrategy proper way to distribute the workload

The approach is correct, as long as the GPUs are on the same host. The TensorFlow manual has examples how the tf.distribute.MirroredStrategy can be used with keras to train the MNIST set.

Is it the MirrordStrategy the only strategy

No, there are multiple strategies that can be used to acheive the workload distribution. For example, the tf.distribute.MultiWorkerMirroredStrategy can also be used to distribute the work on multiple devices trough multiple workers.

The TF documentation explains the strategies, the limitations associated with the strategies and provides some examples to help kick-start the work.

The strategy is throwing an error

According to the issue from github, the ValueError: SyncOnReadVariable does not support 'assign_add' ..... is a bug in TensorFlow which is fixed in TF 2.4

You can try to upgrade the tensorflow libraries by

pip install --ignore-installed --upgrade tensorflow

Implementing variables that are not aware of distributed strategy

If you have tried the standard example from the documentation, and it works fine, but your model is not working, you might be having variables that are incorrectly set-up or you are using distributed variables that do not have support for the aggregation functions required by the distributed strategy.

As per the TF documentation:

..." A distributed variable is variables created on multiple devices. As discussed in the glossary, mirrored variable and SyncOnRead variable are two examples. "...

To better understand how to implement the custom support for the distributed varialbes, check the following page in the documentation

Upvotes: 1

Related Questions