Reputation: 43
Confession: At first, I am not an expert at all in this sector; I am just practicing and trying to learn while working. Also, I am confused about whether this kind of model does not run on this type of GPU or not.
I am trying to run a model locally on my laptop (for now, I have only this machine). I have downloaded the model from Hungging Face The Bloke.
Intension: I am using Langchain, where I will upload some data and have a conversation with the model (roughly, this is the idea, and unfortunately, I cannot express more because of privacy).
Worked So Far: I have used at first llama-cpp-python (CPU) library and attempted to run the model, and it worked. But as predicted, the inference was so slow that it took nearly 2 minutes to answer one question.
Then I tried to build with cuBLAS using the command below:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade llama-cpp-python
It worked, and after running the program, I noticed BLAS = 1 (previously, in CPU version, it was BLAS = 0).
Problem: After running the entire program, I noticed that while I was uploading the data that I wanted to perform the conversation with, the model was not getting loaded onto my GPU, and I got it after looking at Nvidia X Server, where it showed that my GPU memory was not consumed at all, even though in the terminal it was showing that BLAS = 1, and I got the idea that it does not indicate that the model is loaded onto the GPU. Now, I am not sure what to do at this point. I searched the internet but did not get any proper fixes.
Some Additional Problems:
I tried setting n_batch = 256 instead of the default value of 512 to reduce strain on my GPU, but I got the error ValueError: Requested tokens exceeded context window... So, I was wondering how to use the tradeoff between the gpu layers, context window, and batch size? In the documentation of LlamaCpp GPU, it is written like below:
The Code Snippets of My Project Where I Actually Changed The Model:
language_model = LlamaCpp(
model_path="/my/model/path/directory/sub_directory/mistral_7b_v_1/mistral-7b-v0.1.Q2_K.gguf",
n_gpu_layers=1,
n_batch=64,
n_ctx=256,
f16_kv=True,
callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
verbose=True
)
memory = ConversationBufferMemory(
memory_key="chat_history",
return_messages=True
)
return ConversationalRetrievalChain.from_llm(
llm=language_model,
retriever=vectorstore.as_retriever(search_type="mmr", search_kwargs = {"k": 5}),
memory=memory
)
Hardware Details:
Lastly: Thank you for reading this long post. I look forward to some answers, if you may. :)
Upvotes: 3
Views: 2389
Reputation: 11
Change the n_gpu_layers
parameter slowly increase till your gpu runs out of memory.
setting n_gpu_layers
to -1 offloads all layers to the gpu.
check your llama-cpp logs while loading the model:
if they look like this:
main: build = 722 (049aa16)
main: seed = 1
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090
llama.cpp: loading model from models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state)
llama_model_load_internal: offloading 0 repeating layers to GPU
llama_model_load_internal: **offloaded 0/35 layers to GPU**
llama_model_load_internal: total VRAM used: 512 MB
...................................................................................................
llama_init_from_file: kv self size = 256.00 MB
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory
system_info: n_threads = 12 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 256, n_keep = 0
and if it says offloading 0 repeating layers to GPU or any lower number, try and increase it.
Upvotes: 1