Reputation: 1415
I wrote the following piece of code to evaluate the effect of Python multiprocessing while using TensorFlow:
import tensorflow as tf
from multiprocessing import Process
mydevice = "/gpu:0"
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.01)
mrange = 1000
def myfun():
with tf.device(mydevice):
mm1 = tf.constant([[float(i) for i in range(mrange)]],dtype='float32')
mm2 = tf.constant([[float(i)] for i in range(mrange)],dtype='float32')
with tf.device(mydevice):
prod = tf.matmul(mm1,mm2)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True,gpu_options=gpu_options))
rest = sess.run(prod)
print rest
sess.close()
ll = []
for i in range(100):
p1 = Process(target=myfun)
p1.start()
ll.append(p1)
for item in ll:
item.join()
Time taken to run this code on my laptop's GPU: ~6 seconds
If I change the device to CPU: ~6 seconds
If I remove multiprocessing, and call the function serially: 75 seconds
Could someone please explain what exactly would be happening if I use multiprocessing while the device is set to GPU. It is clear that multiple CUDA kernels will be launched, but will they be running concurrently in the GPU?
This is just an experiment to see if I can launch multiple RNNs onto the GPU.
Upvotes: 0
Views: 5274
Reputation: 1733
Fermi and later GPUs support concurrent kernel execution via CUDA streams, which is used by TensorFlow. Therefore, independent ops will run in parallel even if they are in the same graph, launched by a single sess.run
call on a single thread, as long as the CUDA runtime thinks it is beneficial to do so.
Upvotes: 0
Reputation: 12185
GPUs are mainly designed to render 2D and 3D computer graphics. This involves a lot of number crunching which can benefit from parallel algorithms. Deep learning also involves a lot of parallel number crunching so that the same hardware which accelerates graphics can also accelerate deep learning.
What makes a GPU different from a CPU is that it is optimized for highly parallel number crunching. Look at the specs for any Nvidia GPU and you will see a metric called CUDA Cores. This number is usually somewhere in the thousands range (or hundreds for weaker GPUs). A single CUDA core is a lot weaker than a standard CPU core but since you have so many a GPU can greatly out perform a CPU for parallel tasks. The architecture is actually pretty complex which you can read about if you get into CUDA programming. Take a look at this article. https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units
From the numbers you posted I am guessing you have a weak laptop GPU so that is why it performs about the same as the CPU. On my desktop I have the new GTX 1080 and it can beat my CPU by more than 20x. I am surprised that your numbers go up so much when you call it in serial but I think there is something else going on there since I am not even sure how you would do that with tensorflow.
Upvotes: 3