Reputation: 21
I'm using Flask with Gunicorn to implement an AI server. The server takes in HTTP requests and calls the algorithm (built with pytorch). The computation is run on the nvidia GPU.
I need some input as to how can I achieve concurrency/parallelism in this case. The machine has 8 vCPUs, 20 GB memory and 1 GPU, 12 GB memory.
The specific question is
Now my gunicorn command is
gunicorn --bind 0.0.0.0:8002 main:app --timeout 360 --workers=5 --worker-class=gevent --worker-connections=1000
Upvotes: 1
Views: 1898
Reputation: 21
fast Tokenizers are not thread-safe apparently.
AutoTokenizers seems like a wrapper that uses fast or slow internally. their default is set to fast (not thread-safe) .. you'll have to switch that to slow (safe) .. that's why add the use_fast=False flag
I was able to solve this by:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
Best, Chirag Sanghvi
Upvotes: 0