tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1

Question

I am trying to utilize the multi-GPUs using Horovod for distributed training. Initially, I utilized a single GPU and two GPUs to test a simple convolution neural network. Everything functions properly. Then, I used CNN and LSTM in combination. It works perfectly on a single GPU; however, a problem arises when running on two GPUs. The complete Trackback is as follows:

[1,1]:Traceback (most recent call last):
[1,1]:  File "horovod-PAMAP2.py", line 30, in 
[1,1]:    K.set_session(tf.compat.v1.Session(config=config))
[1,1]:  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1586, in __init__
[1,1]:    super(Session, self).__init__(target, graph, config=config)
[1,1]:  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 701, in __init__
[1,1]:    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
[1,1]:tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1
[1,0]:2022-08-14 19:21:07.506043: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5fb58a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
[1,0]:2022-08-14 19:21:07.506129: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 2080 Ti, Compute Capability 7.5
[1,0]:2022-08-14 19:21:07.507537: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
[1,0]:pciBusID: 0000:18:00.0 name: NVIDIA GeForce RTX 2080 Ti computeCapability: 7.5
[1,0]:coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
[1,0]:2022-08-14 19:21:07.515280: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
[1,0]:2022-08-14 19:21:07.534658: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
[1,0]:2022-08-14 19:21:07.670187: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[7328,1],1]
  Exit code:    1
--------------------------------------------------------------------------

Here is the command I am using to run my file on the 2-GPUS:

 /usr/local/bin/horovodrun -np 2 -H localhost:2 /usr/bin/python3 horovod-PAMAP2.py 256

tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1

Answers (0)

Related Questions

tensorflow.python.framework.errors_impl.InvalidArgumentError: &#39;visible_device_list&#39; listed an invalid GPU id &#39;1&#39; but visible device count is 1

Answers (0)

Related Questions

tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1