isarandi
isarandi

Reputation: 3349

MirroredVariable has different values on replicas (zeros, except on one device)

Minimal example to demonstrate the problem:

import tensorflow as tf
    
with tf.distribute.MirroredStrategy().scope():
    print(tf.Variable(1.))

Output on a 4-GPU server:

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
MirroredVariable:{
  0: <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>,
  1: <tf.Variable 'Variable/replica_1:0' shape=() dtype=float32, numpy=0.0>,
  2: <tf.Variable 'Variable/replica_2:0' shape=() dtype=float32, numpy=0.0>,
  3: <tf.Variable 'Variable/replica_2:0' shape=() dtype=float32, numpy=0.0>
}

The problem is, as seen above, that the replicas do not contain the correct variable value, all are zero values except on the first device (the numpy=0.0 parts). This is the same with 2 or 3 devices as well, not just with all 4.

The same code does produce the expected behavior on a different machine.

Correct output on a different, 2-GPU workstation:

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
MirroredVariable:{
  0: <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>,
  1: <tf.Variable 'Variable/replica_1:0' shape=() dtype=float32, numpy=1.0>
}

(Note the value 1.0 on both devices)


The problematic machine is a Dell PowerEdge R750xa with 4x Nvidia A40 GPUs.

The correctly working machine has 2x Titan RTX.

Software config on both:

What could be the reason for such behavior? Glad to provide more details.

Upvotes: 3

Views: 318

Answers (1)

go_krm
go_krm

Reputation: 1

This is not an answer but I think it will be helpful to add here as I cannot add a comment. I have a similar issue and had the question for a while here with no answer: MirroredStrategy output varies depending on the visible GPUs

I think as a workaround for now, you can run nvidia-smi topo -m and check the connections between GPUs

In my case, your example works fine when setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 but fails if CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7,0

Upvotes: 0

Related Questions