HuggingFace Inference Endpoints extremely slow performance

Question

I compute vector embeddings for text paragraphs using the all-MiniLM-L6-v2 model at HuggingFace. Since the free endpoint wasn't always responsive enough and I need to be able to scale, I deployed the model to HuggingFace Inference Endpoints. To begin with, I chose the cheapest endpoint.

To my surprise, a single request to compute 35 embeddings took more than 7 seconds (according to the log at HuggingFace). Based on the suggestion of HuggingFace support, I tried to upgrade to 2 CPUs and it got even slower (to tell the truth, I am not sure why they thought that a single request would benefit from another CPU). Next, I tried GPU. The request now takes 2 seconds.

I must be missing something, because it seems impossible that one would pay >$400/month to serve a single request in 2 seconds, rather than serving thousands of request per second.

I guess that I must be missing something, but I don't see what it could be.

I submit the requests using the command in the following format:

curl https://xxxxxxxxxxxxxx.us-east-1.aws.endpoints.huggingface.cloud -X POST -d '{"inputs": ["My paragraphs are of about 200 words on average", "Another paragraph", etc.]}' -H 'Authorization: Bearer xxxxxxxxxxxxxxxxxxxxxxxxxx' -H 'Content-Type: application/json'

What could I be missing?

P.S. For the GPU, it does get much better once warmed up, achieving 100ms. However, this particular model achieves 14,200 embeddings per second on A100. Granted it's not A100 that I ran it on, but 350 embeddings per second is still way too slow.

HuggingFace Inference Endpoints extremely slow performance

Answers (1)

Related Questions