JoelKuiper
JoelKuiper

Reputation: 4720

Using CUDA GPUs at prediction time for high througput streams

We're trying to develop a Natural Language Processing application that has a user facing component. The user can call models through an API, and get the results back. The models are pretrained using Keras with Theano. We use GPUs to speed up the training. However, prediction is still sped up significantly by using the GPU. Currently, we have a machine with two GPUs. However, at runtime (e.g. when running the user facing bits) there is a problem: multiple Python processes sharing the GPUs via CUDA does not seem to offer a parallelism speed up. We're using nvidia-docker with libgpuarray (pygpu), Theano and Keras. The GPUs are still mostly idle, but adding more Python workers does not speed up the process.

What is the preferred way of solving the problem of running GPU models behind an API? Ideally we'd utilize the existing GPUs more efficiently before buying new ones.

I can imagine that we want some sort of buffer before sending it off to the GPU, rather than requesting a lock for each HTTP call?

Upvotes: 0

Views: 222

Answers (1)

einpoklum
einpoklum

Reputation: 131646

This is not an answer to your more general question, but rather an answer based on how I understand the scenario you described.

If someone has coded a system which uses a GPU for some computational task, they have (hopefully) taken the time to parallelize its execution so as to benefit from the full resources the GPU can offer, or something close to that.

That means that if you add a second similar task - even in parallel - the total amount of time to complete them should be similar to the amount of time to complete them serially, i.e. one after the other - since there are very little underutilized GPU resources for the second task to benefit from. In fact, it could even be the case that both tasks will be slower (if, say, they both somehow utilize the L2 cache a lot, and when running together they thrash it).

At any rate, when you want to improve performance, a good thing to do is profile your application - in this case, using the nvprof profiler or its nvvp frontend (the first link is the official documentation, the second link is a presentation).

Upvotes: 1

Related Questions