Reputation: 8051
I compute vector embeddings for text paragraphs using the all-MiniLM-L6-v2 model at HuggingFace. Since the free endpoint wasn't always responsive enough and I need to be able to scale, I deployed the model to HuggingFace Inference Endpoints. To begin with, I chose the cheapest endpoint.
To my surprise, a single request to compute 35 embeddings took more than 7 seconds (according to the log at HuggingFace). Based on the suggestion of HuggingFace support, I tried to upgrade to 2 CPUs and it got even slower (to tell the truth, I am not sure why they thought that a single request would benefit from another CPU). Next, I tried GPU. The request now takes 2 seconds.
I must be missing something, because it seems impossible that one would pay >$400/month to serve a single request in 2 seconds, rather than serving thousands of request per second.
I guess that I must be missing something, but I don't see what it could be.
I submit the requests using the command in the following format:
curl https://xxxxxxxxxxxxxx.us-east-1.aws.endpoints.huggingface.cloud -X POST -d '{"inputs": ["My paragraphs are of about 200 words on average", "Another paragraph", etc.]}' -H 'Authorization: Bearer xxxxxxxxxxxxxxxxxxxxxxxxxx' -H 'Content-Type: application/json'
What could I be missing?
P.S. For the GPU, it does get much better once warmed up, achieving 100ms. However, this particular model achieves 14,200 embeddings per second on A100. Granted it's not A100 that I ran it on, but 350 embeddings per second is still way too slow.
Upvotes: 2
Views: 3743
Reputation: 1435
To test the one CPU core efficiency I used:
from sentence_transformers import SentenceTransformer
import time
sentences = ["This is an example sentence each sentence is converted"] * 10
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device='cpu')
for i in range(100):
start_time = time.time()
embeddings = model.encode(sentences)
end = time.time()
print("Time taken: ", end - start_time)
And ran it with:
taskset -c 0 python soquestion.py
Which run 10 sentence embeddings in this much seconds:
...
Time taken: 0.035448551177978516
Time taken: 0.035162925720214844
Time taken: 0.03574204444885254
Time taken: 0.035799264907836914
Time taken: 0.03513455390930176
Time taken: 0.03690838813781738
Time taken: 0.035082340240478516
Time taken: 0.035216331481933594
Time taken: 0.0348513126373291
...
But if I use all of my cores:
...
Time taken: 0.016519546508789062
Time taken: 0.01624751091003418
Time taken: 0.017212390899658203
Time taken: 0.016582727432250977
Time taken: 0.019397735595703125
Time taken: 0.016611814498901367
Time taken: 0.017941713333129883
Time taken: 0.01743769645690918
...
So I would say core count affects speed. I'm using an AMD Ryzen 5 5000, so might or might not be significantly slower than the Intel Xeon Ice Lake
CPUs Hugging Face provide (they don't really tell you the model and the performance varies so much...).
However, I can say that your instances are insufficient memory wise because the docs for pricing states:
aws small $0.06 1 2GB Intel Xeon - Ice Lake
aws medium $0.12 2 4GB Intel Xeon - Ice Lake
aws large $0.24 4 8GB Intel Xeon - Ice Lake
aws xlarge $0.48 8 16GB Intel Xeon - Ice Lake
azure small $0.06 1 2GB Intel Xeon
azure medium $0.12 2 4GB Intel Xeon
azure large $0.24 4 8GB Intel Xeon
azure xlarge $0.48 8 16GB Intel Xeon
And you mentioned using 1 to 2 vCPUs which comes with 2-4 GB of RAM. I investigated how much RAM is used by this process by:
/usr/bin/time -v python soquestion.py |& grep resident
Maximum resident set size (kbytes): 981724
Average resident set size (kbytes): 0
Which is 1 GB. Compared to CPU instances, a lot. Compared to GPU instances, very little. I would suggest you to consider upgrading your instances altough I came across this question which is struggling even with 4 GB of RAM.
Upvotes: 2