Reputation: 51
I am trying to run a keras code on a GPU node within a cluster. The GPU node has 4 GPUs per node. I made sure to have all 4 GPUs within the GPU node available for my use. I run the code below to let tensorflow use the GPU:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
print(e)
The 4 GPUs available get listed in the output. However, I got the following error when running the code:
Traceback (most recent call last):
File "/BayesOptimization.py", line 20, in <module>
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/framework/config.py", line 439, in list_logical_devices
return context.context().list_logical_devices(device_type=device_type)
File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1368, in list_logical_devices
self.ensure_initialized()
File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 511, in ensure_initialized
config_str = self.config.SerializeToString()
File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1015, in config
gpu_options = self._compute_gpu_options()
File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1074, in _compute_gpu_options
raise ValueError("Memory growth cannot differ between GPU devices")
ValueError: Memory growth cannot differ between GPU devices
Shouldn't the code list all the available gpus and set memory growth to true for each one?
I am currently using tensorflow libraries and python 3.97:
tensorflow 2.4.1 gpu_py39h8236f22_0
tensorflow-base 2.4.1 gpu_py39h29c2da4_0
tensorflow-estimator 2.4.1 pyheb71bc4_0
tensorflow-gpu 2.4.1 h30adc30_0
Any idea what the problem is and how to solve it? Thanks in advance!
Upvotes: 5
Views: 5806
Reputation: 525
I think its just a matter of being careful with indentations. For me your code does not work:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
print(e)
BUT if you wait before calling logical_gpus = tf.config.list_logical_devices('GPU') until the end then it works. SO change to
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
print(e)
I think what is happening is you need to set ALL VISIBLE GPUs in the config before calling any other accessors to the config. So your other option is to call tf.config.set_visible_devices FIRST. If you limit to one GPU you dont need the loop if you do two or more you need to be careful to call set_memory_growth on all of them before accessing the config or running any tf code.
Upvotes: 0
Reputation: 1
In case of multi-GPU devices memory growth should be constant through out all available GPUs. Either set it true for all GPUs or keep it false.
gpus = tf.config.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
Upvotes: 0
Reputation: 21
Try only: os.environ["CUDA_VISIBLE_DEVICES"]="0" instead of tf.config.experimental.set_memory_growth. This works for me.
Upvotes: 2