chiragsanghvi
chiragsanghvi

Reputation: 21

Gunicorn worker, threads for GPU tasks to increase concurrency/parallelism

I'm using Flask with Gunicorn to implement an AI server. The server takes in HTTP requests and calls the algorithm (built with pytorch). The computation is run on the nvidia GPU.

I need some input as to how can I achieve concurrency/parallelism in this case. The machine has 8 vCPUs, 20 GB memory and 1 GPU, 12 GB memory.

The specific question is

  1. How can I increase the concurrency/parallelism?
  2. Do I have to specify number of threads for computation on GPU?

Now my gunicorn command is

gunicorn --bind 0.0.0.0:8002 main:app --timeout 360 --workers=5 --worker-class=gevent --worker-connections=1000

Upvotes: 1

Views: 1898

Answers (1)

chiragsanghvi
chiragsanghvi

Reputation: 21

fast Tokenizers are not thread-safe apparently.

AutoTokenizers seems like a wrapper that uses fast or slow internally. their default is set to fast (not thread-safe) .. you'll have to switch that to slow (safe) .. that's why add the use_fast=False flag

I was able to solve this by:

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)

Best, Chirag Sanghvi

Upvotes: 0

Related Questions