Reputation: 742
I am trying to utilize the multi-GPUs using Horovod for distributed training. Initially, I utilized a single GPU and two GPUs to test a simple convolution neural network. Everything functions properly. Then, I used CNN and LSTM in combination. It works perfectly on a single GPU; however, a problem arises when running on two GPUs. The complete Trackback is as follows:
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>: File "horovod-PAMAP2.py", line 30, in <module>
[1,1]<stderr>: K.set_session(tf.compat.v1.Session(config=config))
[1,1]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1586, in __init__
[1,1]<stderr>: super(Session, self).__init__(target, graph, config=config)
[1,1]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 701, in __init__
[1,1]<stderr>: self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
[1,1]<stderr>:tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1
[1,0]<stderr>:2022-08-14 19:21:07.506043: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5fb58a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
[1,0]<stderr>:2022-08-14 19:21:07.506129: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce RTX 2080 Ti, Compute Capability 7.5
[1,0]<stderr>:2022-08-14 19:21:07.507537: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
[1,0]<stderr>:pciBusID: 0000:18:00.0 name: NVIDIA GeForce RTX 2080 Ti computeCapability: 7.5
[1,0]<stderr>:coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
[1,0]<stderr>:2022-08-14 19:21:07.515280: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
[1,0]<stderr>:2022-08-14 19:21:07.534658: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
[1,0]<stderr>:2022-08-14 19:21:07.670187: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[7328,1],1]
Exit code: 1
--------------------------------------------------------------------------
Here is the command I am using to run my file on the 2-GPUS:
/usr/local/bin/horovodrun -np 2 -H localhost:2 /usr/bin/python3 horovod-PAMAP2.py 256
Upvotes: 1
Views: 230