Reputation: 77
I am trying to learn about using LLM models.
I am using Ollama to pull models and then the Langchain framework to implement. Implementation code is run on locally hosted Jupyter notebook. All is running on WSL2 on a Windows laptop with intel Core i5 and 16G of RAM. WSL config is default so 50% of host RAM.
The gemma:2b
model takes around 30 seconds to reply to the prompt in the langchain quickstart tutorial https://python.langchain.com/docs/get_started/quickstart (code is identical to the tutorial up to “Diving Deeper” choosing the option: Local (using Ollama))
chain.invoke({"input": "how can langsmith help with testing?"})
If I use a larger model like mistral (7.3B)
then the response time is around 1 minute 15 seconds.
I understand that more RAM and a GPU would be preferable but why does it appear that only 1.2G of RAM is being used ?
➜ \~ free -mh -s 10
total used free shared buff/cache available
Mem: 7.6Gi 1.2Gi 1.7Gi 2.3Mi 4.9Gi 6.4Gi
Swap: 2.0Gi 0B 2.0Gi
Is there a way to configure Ollama to use more RAM ?
Observed : free -mh
shows that only 1.2G of RAM is being used with 6.4G still available during Ollama compute.
Expected : Ollama uses all available RAM (more like 7-8G) during compute. Response time will be quicker.
I can't find anything on the internet, not even people asking the same question (which normally means that i've completely misunderstood what my issue is...)
The RAM is available to WSL as other ressource heavy developpment projects use all available RAM (between 7 and 8G) (hosting gitlab, gitlab runner, nexus and other dockerised VMs at the same time).
Here are the Ollama logs :
➜ ~ ollama serve
time=2024-02-27T13:53:29.377+01:00 level=INFO source=images.go:710 msg="total blobs: 5"
time=2024-02-27T13:53:29.378+01:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-02-27T13:53:29.380+01:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)"
time=2024-02-27T13:53:29.382+01:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-02-27T13:53:34.146+01:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [rocm_v6 cpu cpu_avx2 cpu_avx cuda_v11 rocm_v5]"
time=2024-02-27T13:53:34.146+01:00 level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-27T13:53:34.146+01:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-02-27T13:53:38.249+01:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-02-27T13:53:38.249+01:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so"
time=2024-02-27T13:53:38.249+01:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-02-27T13:53:38.249+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-27T13:53:38.249+01:00 level=INFO source=routes.go:1042 msg="no GPU detected"
[GIN] 2024/02/27 - 13:55:32 | 200 | 37.084µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/02/27 - 13:55:32 | 200 | 910.269µs | 127.0.0.1 | POST "/api/show"
[GIN] 2024/02/27 - 13:55:32 | 200 | 945.017µs | 127.0.0.1 | POST "/api/show"
time=2024-02-27T13:55:32.502+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-27T13:55:32.502+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-27T13:55:32.502+01:00 level=INFO source=llm.go:77 msg="GPU not available, falling back to CPU"
loading library /tmp/ollama930891318/cpu_avx2/libext_server.so
time=2024-02-27T13:55:32.503+01:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama930891318/cpu_avx2/libext_server.so"
time=2024-02-27T13:55:32.503+01:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /home/unix/.ollama/models/blobs/sha256:e8a35b5937a5e6d5c35d1f2a15f161e07eefe5e5bb0a3cdd42998ee79b057730 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 2
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,58980] = ["▁ t", "i n", "e r", "▁ a", "h e...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 3.83 GiB (4.54 BPW)
llm_load_print_meta: general.name = mistralai
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.11 MiB
llm_load_tensors: CPU buffer size = 3917.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CPU input buffer size = 13.02 MiB
llama_new_context_with_model: CPU compute buffer size = 160.00 MiB
llama_new_context_with_model: graph splits (measure): 1
time=2024-02-27T13:55:46.285+01:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"
[GIN] 2024/02/27 - 13:55:46 | 200 | 13.934248657s | 127.0.0.1 | POST "/api/chat"
[GIN] 2024/02/27 - 14:00:03 | 200 | 1m14s | 127.0.0.1 | POST "/api/generate"
Upvotes: 6
Views: 10350
Reputation: 83
Model data is memory mapped and shows up in file cache #. Note too, VIRT, RES & SHR memory # of the Ollama processes.
This github issuse describes your problem very well: https://github.com/ollama/ollama/issues/2496
Upvotes: 1