Mahmud Arfan
Mahmud Arfan

Reputation: 43

LLM model is not loading into the GPU even after BLAS = 1, LlamaCpp, Langchain, Mistral 7b GGUF Model

Confession: At first, I am not an expert at all in this sector; I am just practicing and trying to learn while working. Also, I am confused about whether this kind of model does not run on this type of GPU or not.

I am trying to run a model locally on my laptop (for now, I have only this machine). I have downloaded the model from Hungging Face The Bloke.

Intension: I am using Langchain, where I will upload some data and have a conversation with the model (roughly, this is the idea, and unfortunately, I cannot express more because of privacy).

Worked So Far: I have used at first llama-cpp-python (CPU) library and attempted to run the model, and it worked. But as predicted, the inference was so slow that it took nearly 2 minutes to answer one question.

Then I tried to build with cuBLAS using the command below:

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade llama-cpp-python

It worked, and after running the program, I noticed BLAS = 1 (previously, in CPU version, it was BLAS = 0).

Problem: After running the entire program, I noticed that while I was uploading the data that I wanted to perform the conversation with, the model was not getting loaded onto my GPU, and I got it after looking at Nvidia X Server, where it showed that my GPU memory was not consumed at all, even though in the terminal it was showing that BLAS = 1, and I got the idea that it does not indicate that the model is loaded onto the GPU. Now, I am not sure what to do at this point. I searched the internet but did not get any proper fixes.

Some Additional Problems: I tried setting n_batch = 256 instead of the default value of 512 to reduce strain on my GPU, but I got the error ValueError: Requested tokens exceeded context window... So, I was wondering how to use the tradeoff between the gpu layers, context window, and batch size? In the documentation of LlamaCpp GPU, it is written like below: enter image description here

The Code Snippets of My Project Where I Actually Changed The Model:

language_model = LlamaCpp(
    model_path="/my/model/path/directory/sub_directory/mistral_7b_v_1/mistral-7b-v0.1.Q2_K.gguf",
    n_gpu_layers=1,
    n_batch=64,
    n_ctx=256,
    f16_kv=True,
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
    verbose=True
)
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)
return ConversationalRetrievalChain.from_llm(
    llm=language_model,
    retriever=vectorstore.as_retriever(search_type="mmr", search_kwargs = {"k": 5}),
    memory=memory
)

Hardware Details:

  1. GPU: NVIDIA GeForce RTX 3050 Laptop GPU / AMD Renoir
  2. GPU VRAM: 4 GB (3.8 GB usable)
  3. CPU: AMD® Ryzen 9 5900hx with radeon graphics × 16
  4. Machine RAM: 16 GB
  5. Model Max RAM Required: 5.58 (Is this the main reason of not running?)

Lastly: Thank you for reading this long post. I look forward to some answers, if you may. :)

Upvotes: 3

Views: 2389

Answers (1)

Raj Singh Parihar
Raj Singh Parihar

Reputation: 11

Change the n_gpu_layers parameter slowly increase till your gpu runs out of memory.

setting n_gpu_layers to -1 offloads all layers to the gpu.

check your llama-cpp logs while loading the model:

if they look like this:

main: build = 722 (049aa16)
main: seed  = 1
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
llama_model_load_internal: offloading 0 repeating layers to GPU
llama_model_load_internal: **offloaded 0/35 layers to GPU**
llama_model_load_internal: total VRAM used: 512 MB
...................................................................................................
llama_init_from_file: kv self size  =  256.00 MB
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory

system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0

and if it says offloading 0 repeating layers to GPU or any lower number, try and increase it.

Upvotes: 1

Related Questions