JonB
JonB

Reputation: 93

verify TensorFlow is maximizing use of NVIDIA GPU

I recently setup an Ubuntu 19.04 machine with a single NVIDIA GeForce GTX 1060 6GB card. This is my first experience using all the deep learning stuff in Linux. I followed several different instructional blogs to get what I believe is a functional CUDA/Python environment. My next step was to run some sample "large jobs" and verify I am getting the expected performance.

I came across this link - https://medium.com/@andriylazorenko/tensorflow-performance-test-cpu-vs-gpu-79fcd39170c. It seems to have example/tutorial code which will exercise my system and also has some benchmark performance stats.

When I run python cifar10_train.py as instructed from tutorials/image/cifar10, I see the following:

Every 1.0s: nvidia-smi                                                                                                                                                  prospector: Wed Sep 25 06:33:56 2019

Wed Sep 25 06:33:56 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  On   | 00000000:41:00.0 Off |                  N/A |
| 33%   33C    P2    31W / 120W |   6034MiB /  6075MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    101339      C   python                                      6023MiB |
+-----------------------------------------------------------------------------+

So it appears the job is running on the GPU and using all available memory. However, this test also pins all 24 cores of my Threadripper 1920X. And the GPU temp only increases by a few degrees with the fan never really going past 33%. Some relevant output from the python script:

2019-09-25 06:33:30.860034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5702 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:41:00.0, compute capability: 6.1)
2019-09-25 06:33:30.862049: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55f63af4ce80 executing computations on platform CUDA. Devices:
2019-09-25 06:33:30.862071: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1060 6GB, Compute Capability 6.1
2019-09-25 06:33:30.951617: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
I0925 06:33:31.438272 140501771114304 session_manager.py:500] Running local_init_op.
I0925 06:33:31.456041 140501771114304 session_manager.py:502] Done running local_init_op.
I0925 06:33:31.804769 140501771114304 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /tmp/cifar10_train/model.ckpt.
2019-09-25 06:33:32.217337: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-09-25 06:33:32.521519: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-09-25 06:33:33.481558: step 0, loss = 4.68 (453.2 examples/sec; 0.282 sec/batch)
2019-09-25 06:33:36.138869: step 10, loss = 4.59 (481.7 examples/sec; 0.266 sec/batch)
2019-09-25 06:33:38.594918: step 20, loss = 4.39 (521.2 examples/sec; 0.246 sec/batch)
2019-09-25 06:33:41.108821: step 30, loss = 4.49 (509.2 examples/sec; 0.251 sec/batch)
2019-09-25 06:33:43.613224: step 40, loss = 4.40 (511.1 examples/sec; 0.250 sec/batch)
2019-09-25 06:33:46.074761: step 50, loss = 4.36 (520.0 examples/sec; 0.246 sec/batch)
2019-09-25 06:33:48.577182: step 60, loss = 4.07 (511.5 examples/sec; 0.250 sec/batch)
2019-09-25 06:33:51.072065: step 70, loss = 4.19 (513.1 examples/sec; 0.249 sec/batch)
2019-09-25 06:33:53.582682: step 80, loss = 4.14 (509.8 examples/sec; 0.251 sec/batch)
2019-09-25 06:33:55.993107: step 90, loss = 4.14 (531.0 examples/sec; 0.241 sec/batch)

I had some install issues the first time I tried this and had to reinstall everything to get it to see the GPU. However, from the above output and the nvidia-smi output, the test is clearly using the GPU. However, when I look at the examples/sec the author of the post gets when using a single 1070 card, he claims he is getting over 6,000 which is 12X what I see!

My questions are:

1) Am I getting the performance from this test I should expect based on my hardware (32MB RAM, 6GB VRAM)?

2) Should I expect this test to also max out all 24 of my Threadripper cores and bring the CPU to almost max temp at 68C?

3) Is there a better validation test I can run?

I am hoping to make sure my system is setup correctly and if not what I should do to track down any bottlenecks or misconfigurations.

PS - I also noticed something called Tensor RT. How does that fit in to the performance picture?

Thanks RaJa for responding. I ran the first test from the link you sent and, just like the output from the TensorFlow load test I ran above, it shows that it recognizes and is using the GPU:

2019-09-25 07:55:24.868796: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-09-25 07:55:24.989039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:41:00.0
totalMemory: 5.93GiB freeMemory: 5.86GiB
2019-09-25 07:55:24.989066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-09-25 07:55:25.302006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-25 07:55:25.302044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0
2019-09-25 07:55:25.302050: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N
2019-09-25 07:55:25.302169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5640 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:41:00.0, compute capability: 6.1)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:41:00.0, compute capability: 6.1
2019-09-25 07:55:25.360593: I tensorflow/core/common_runtime/direct_session.cc:288] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:41:00.0, compute capability: 6.1

MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2019-09-25 07:55:25.361267: I tensorflow/core/common_runtime/placer.cc:935] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-09-25 07:55:25.361289: I tensorflow/core/common_runtime/placer.cc:935] a: (Const)/job:localhost/replica:0/task:0/device:GPU:0
b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-09-25 07:55:25.361302: I tensorflow/core/common_runtime/placer.cc:935] b: (Const)/job:localhost/replica:0/task:0/device:GPU:0
[[22. 28.]
 [49. 64.]]

I'm less concerned about whether it recognizes the GPU and is using it and am more concerned about whether it is maximizing its use of the GPU. See my 3 questions above. I'm most concerned about the fact that I'm only seeing 1/12th the performance that the author of the TensorFlow load test is seeing with what I believe should be a similar card (1060 vs 1070). Am I missing something? Should I only expect this level of performance? What do other people see for performance using the same load test and a similar card? Would TensorRT help?

Thanks!

Upvotes: 2

Views: 865

Answers (1)

RaJa
RaJa

Reputation: 1567

So I have found the reason or a better check. The official cifar10-test is somewhat bugged as there is already an issue in the tensorflow-repository.

I have found a better, working example here: multi-gpu example

Just run the file with python3 multigpu_cnn.pyand watch the number of samples per sec. Then you can change the number of gpus in the file and check again.

My stats (4 Titan Z):

  • 2 GPUs -> 8800 samples/sec
  • 4 GPUs -> 15500 samples/sec

Checking with nvidia-smishows 100% load on all GPUs for the last example.

Upvotes: 1

Related Questions