Using CUDA GPUs at prediction time for high througput streams

Question

We're trying to develop a Natural Language Processing application that has a user facing component. The user can call models through an API, and get the results back. The models are pretrained using Keras with Theano. We use GPUs to speed up the training. However, prediction is still sped up significantly by using the GPU. Currently, we have a machine with two GPUs. However, at runtime (e.g. when running the user facing bits) there is a problem: multiple Python processes sharing the GPUs via CUDA does not seem to offer a parallelism speed up. We're using nvidia-docker with libgpuarray (pygpu), Theano and Keras. The GPUs are still mostly idle, but adding more Python workers does not speed up the process.

What is the preferred way of solving the problem of running GPU models behind an API? Ideally we'd utilize the existing GPUs more efficiently before buying new ones.

I can imagine that we want some sort of buffer before sending it off to the GPU, rather than requesting a lock for each HTTP call?

Using CUDA GPUs at prediction time for high througput streams

Answers (1)

Related Questions