Reputation: 11
I am testing different AI models on runpod.io. One of those models is dolphin-mixtral:8x22b. I followed Runpod's tutorial for setting up the pod with Ollama: https://docs.runpod.io/tutorials/pods/run-ollama, and I used the H100SXM with 80GB VRAM and 16 vCPU 125 GB RAM.
However, when I start the model and ask it something like "hey," it uses 100% of the CPU and 0% of the GPU, and the response takes 5-10 minutes.
How to make Ollama use my GPU?
I tried different server settings
Upvotes: 0
Views: 1520
Reputation: 1697
I just came across same problem. I ended up with running smaller models (fewer params).
Because what happens is that the VRAM
is being fully occupied by this large model and thus parts of the model are forcibly shifted to run on the RAM
and on the CPU
.
Upvotes: 0
Reputation: 21
I ran into the same issue, found the answer in this Reddit post.
Set the CUDA_VISIBLE_DEVICES
environment variable to 0,1
before running ollama serve
:
:/# export CUDA_VISIBLE_DEVICES=0,1
:/# echo $CUDA_VISIBLE_DEVICES
0,1
:/# ollama serve
Tested on an A4000 pod.
Upvotes: 1