I have problem using n_gpu_layers in llama_cpp Llama function

Question

I am attempting to load the Zephyr model into llama_cpp Llama, and while everything functions correctly, the performance is slow. The GPU appears to be underutilized, especially when compared to its performance in LM Studio, where the same number of GPU layers results in much faster output and noticeable spikes in GPU usage.

Essentially, I'm aiming for performance in the terminal that matches the speed of LM Studio, but I'm unsure how to achieve this optimization. There are no apparent bugs, and the configuration for Llama is as follows :

Llama(  "n_gpu_layers": 32,
  "n_threads": 6,
  "verbose": false,
  "model_path": "zephyr-7b-beta.Q4_K_M.gguf",
  "n_ctx": 2048,
  "seed": 0,
  "n_batch": 512,
  "use_mmap": true,
  "use_mlock": false,
  "mul_mat_q": true,
  "low_vram": false,
  "rope_freq_base": 10000.0,
  "tensor_split": null,
  "rope_freq_scale": 1.0)

I am also loading history in it but still cannot see the usage of gpu

I have problem using n_gpu_layers in llama_cpp Llama function

Answers (0)

Related Questions