Gunicorn worker, threads for GPU tasks to increase concurrency/parallelism

Question

I'm using Flask with Gunicorn to implement an AI server. The server takes in HTTP requests and calls the algorithm (built with pytorch). The computation is run on the nvidia GPU.

I need some input as to how can I achieve concurrency/parallelism in this case. The machine has 8 vCPUs, 20 GB memory and 1 GPU, 12 GB memory.

1 worker occupies, 4 GB memory, 2.2GB GPU memory. max workers I can give is 5. (Because of GPU memory 2.2 GB * 5 workers = 11 GB )
1 worker = 1 HTTP request (max simultaneous requests = 5)

The specific question is

How can I increase the concurrency/parallelism?
Do I have to specify number of threads for computation on GPU?

Now my gunicorn command is

gunicorn --bind 0.0.0.0:8002 main:app --timeout 360 --workers=5 --worker-class=gevent --worker-connections=1000

Gunicorn worker, threads for GPU tasks to increase concurrency/parallelism

Answers (1)

Related Questions