Reputation: 361
I found this tutorial for using TGI (Text Generation Inference) with the docker image at Text Generation Inference.
However, I’m having trouble using a GPU in a docker container. I was wondering if there is another way to stream the output of the model. I have tried using TextStreamer, but it can only output the result to standard output. In my case, I’m trying to send the stream output to the frontend, similar to how it works in ChatGPT
Upvotes: 1
Views: 743
Reputation: 361
I have found the answer, we can do this in transformers
from threading import Thread
from transformers import TextIteratorStreamer,
inputs = tokenizer(prompt_template, return_tensors="pt").input_ids.cuda()
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True)
generation_kwargs = {
"inputs": inputs,
"streamer": streamer,
"max_new_tokens": 512,
"stopping_criteria": stop_criteria,
"temperature": 0.7,
}
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for _, new_text in enumerate(streamer):
yield new_text
Upvotes: 1
Reputation: 512
You should probably proceed with TGI.
To use a GPU within a Docker container, do the following:
sudo nvidia-ctk runtime configure --runtime=docker
docker run --runtime=nvidia --gpus all -it <YOUR_IMAGE_TAG>
Upvotes: -1