llama-cpp-python model not using nvidia gpu

Question

Trying to run the below model and it is not running using GPU and defaulting to CPU compute.

The code is run on docker image on RHEL node that has NVIDIA GPU (verified and works on other models)

Docker command:

docker run -it --rm -p 8888:8888 --runtime=nvidia --gpus all -v /users/jupyter/data:/data -v /users/jupyter/notebooks:/project/notebooks llama-gpu

Model llama-2-7b-chat.Q3_K_L.gguf

Example

    !export FORCE_CMAKE=1
    
    !export CMAKE_ARGS="-DLLAMA_CUBLAS=on"
    
    !export LLAMA_CPP_LIB=/azureml-envs/tensorflow-2.12-cuda11/lib/python3.8/site-packages/llama_cpp_cuda/libllama.so
    
    pip install llama-cpp-python

    from llama_cpp import Llama

    def question_generator(context):

    prompt = """[INST] <>
        You are a helpful, respectful and honest assistant.
        Always respond as helpfully as possible, while being safe.
        Please ensure you generate the question based on the given context only
        <>
        generate 3 questions based on the given content:-{}.
        """.format(context)
    

    llm = Llama(
        model_path="llama-2-7b-chat.Q3_K_L.gguf",
        n_ctx=8192,
        n_batch=512,
        use_mlock=True,
        n_gpu_layers=248,
        n_threads=8
    )

    
    output = llm(prompt,
               max_tokens=-1,
               echo=False,
               temperature=0.2,
               top_p=0.1)
    
    return output['choices'][0]['text']

    df["questions"]=""

    for i in range(len(df)):
        df["questions"].iloc[i]=question_generator(df["text"].iloc[i])

Tried below changes from other suggestions. It still doesn't use GPU compute

 CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

llama-cpp-python model not using nvidia gpu

Answers (0)

Related Questions