Reputation: 33
I have two GPUs and want to try some distributed training(model-parallelism) in TensorFlow.
The two GPUs are:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: TITAN Xp COLLECTORS EDITION, pci bus id: 0000:04:00.0, compute capability: 6.1
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: TITAN X (Pascal), pci bus id: 0000:82:00.0, compute capability: 6.1
My plan is to divide LeNet into two parts, assign each part to one GPU.
LeNet has 5 layers, I use with tf.device('/gpu:0'):
to assign layer 1 to GPU 0, with tf.device('/gpu:1'):
to assign layer2-layer5 to GPU 1.
I know there is no need to do model-parallelism in this model, but I just want to try model-parallelism in small models.
The log for device mapping shows that, all the ops have been assigned to the device as I wish:
layer5/fc3_b: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:1
layer5/fc3_b/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:1
layer5/fc3_b/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:1
layer5/fc3_w: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:1
layer5/fc3_w/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:1
layer5/truncated_normal/TruncatedNormal: (TruncatedNormal): /job:localhost/replica:0/task:0/device:GPU:1
layer5/truncated_normal/mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:1
layer5/truncated_normal: (Add): /job:localhost/replica:0/task:0/device:GPU:1
layer5/fc3_w/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:1
layer4/fc2_b: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:1
layer4/fc2_b/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:1
layer4/fc2_b/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:1
layer4/fc2_w: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:1
layer4/fc2_w/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:1
layer4/truncated_normal/TruncatedNormal: (TruncatedNormal): /job:localhost/replica:0/task:0/device:GPU:1
layer4/truncated_normal/mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:1
layer4/truncated_normal: (Add): /job:localhost/replica:0/task:0/device:GPU:1
layer4/fc2_w/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:1
layer3/fc1_b: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:1
layer3/fc1_b/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:1
layer3/fc1_b/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:1
layer3/fc1_w: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:1
layer3/fc1_w/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:1
layer3/truncated_normal/TruncatedNormal: (TruncatedNormal): /job:localhost/replica:0/task:0/device:GPU:1
layer3/truncated_normal/mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:1
layer3/truncated_normal: (Add): /job:localhost/replica:0/task:0/device:GPU:1
layer3/fc1_w/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:1
layer2/conv2_b: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:1
layer2/conv2_b/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:1
layer2/conv2_b/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:1
layer2/conv2_w: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:1
layer2/conv2_w/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:1
layer2/truncated_normal/TruncatedNormal: (TruncatedNormal): /job:localhost/replica:0/task:0/device:GPU:1
layer2/truncated_normal/mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:1
layer2/truncated_normal: (Add): /job:localhost/replica:0/task:0/device:GPU:1
layer2/conv2_w/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:1
init/NoOp_1: (NoOp): /job:localhost/replica:0/task:0/device:GPU:1
layer1/conv1_b: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0
layer1/conv1_b/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
layer1/conv1_b/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:0
layer1/conv1_w: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0
layer1/conv1_w/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
layer1/truncated_normal/TruncatedNormal: (TruncatedNormal): /job:localhost/replica:0/task:0/device:GPU:0
layer1/truncated_normal/mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:0
layer1/truncated_normal: (Add): /job:localhost/replica:0/task:0/device:GPU:0
layer1/conv1_w/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:0
But I got a different result in timeline.json
as shown in the figure below.
The timeline shows that, it seems ops of layer2-layer5 are launched in GPU1 but run on GPU0. I don't think this is what I want by using with tf.device('/gpu:1'):
.
Is this expected in TensorFlow?
This is the first time I asked question on stack overflow, if any other information is needed, please inform me, thanks.
Upvotes: 3
Views: 1541
Reputation: 56
This is just an artifact of Chrome Trace Event Format.
Stream, "/job:localhost/replica:0/task:0/devuce:GPU:0 Compute"
shows the time to launch/queue CUDA kernels for ops being executed on GPU:0.
Stream, "/job:localhost/replica:0/task:0/devuce:GPU:1 Compute"
shows the time to launch/queue CUDA kernels for ops being executed on GPU:1.
All streams matching, "/device:GPU:0/stream.*
Compute" shows the time to actually execute Operations on all the GPUs. To find out, on which GPU the operation is actually executed, you need to look in the streams "/job:localhost/replica:0/task:0/devuce:GPU:.* Compute"
Hope this answers your question
Upvotes: 4