Very slow Response from LLM based Q/A query engine

I built a Q/A query bot over a 4MB csv file I have in my local, I'm using chroma for vector DB creation and with embedding model being Instructor Large from hugging face, and LLM chat model being LlamaCPP=llama2-13b-chat, The Vector Database created was around 44MB (stored it on Local), and after vector DB creation, I used it to make the query Q/A bot but the response is too slow, it takes around 30-40 mins for each response to be generated, in addition it says Llama.generate: prefix-match hit as warning from the 2nd question itself. I'm not understanding why it is so slow...

Is there something wrong with the models?
Or is it because of my PC capabilities? Although I think my PC is well capable enough to handle such small data and these models..(my CPU usage during response time less than 60%)
Am I doing anything wrong? Pretty new to this stuff

my PC specifications: Processor : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 2.80 GHz RAM: 16GB System Type: 64bit OS, x64-based processor

from llama_index import load_index_from_storage
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.index_store import SimpleIndexStore
from llama_index import LangchainEmbedding, ServiceContext, StorageContext, download_loader, LLMPredictor
from langchain.embeddings import HuggingFaceEmbeddings

from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.response_synthesizers import get_response_synthesizer

import chromadb
from chromadb.config import Settings

## create ChromaClient again
chroma_client = chromadb.PersistentClient(path="./storage/vector_storage/chromadb/")

# load the collection
collection = chroma_client.get_collection("csv_ecgi_db")

## construct storage context
load_storage_context = StorageContext.from_defaults(
    vector_store = ChromaVectorStore(chroma_collection=collection),
    index_store = SimpleIndexStore.from_persist_dir(persist_dir="./storage/index_storage/ecgi/"),
)

embeddiing_model_id = 'hkunlp/instructor-large'

embed_model = LangchainEmbedding(HuggingFaceEmbeddings(model_name = embeddiing_model_id))

## construct service context
load_service_context = ServiceContext.from_defaults(embed_model=embed_model)

## finally to load the index
load_index = load_index_from_storage(service_context=load_service_context, 
                                     storage_context=load_storage_context)

# configure response synthesizer
response_synthesizer = get_response_synthesizer(
   response_mode='compact',
    service_context = load_service_context)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever = retriever,
    response_synthesizer = response_synthesizer,
)

# query
response = query_engine.query("what were the danish Horror movies in february of 2023?")
response

I looked over git, I found some people were discussing about the same stuff thing but no conclusion was reached, but there response time was similar to mine. I was expecting it to respond within seconds like ChatGPT does.

Upvotes: 5

Answers (2)

JeroenAdam

Reputation: 31

Look no further, it's your CPU which can only deliver full performance for maybe 10 sec. then power throttling kicks in and limits your CPU package power to only 10W. You could look for an external GPU, I'm running Mistral 7b 4-bit (K_S) on such an 11th gen laptop with 75% of layers offloaded to a 4GB GDDR6 GPU. I get 8,5 tokens per second. If I run that on CPU only, I get 2,5 t/s. Check your task manager and see the clock speed dropping as soon as inference has started.

Upvotes: 1

Konrad Höffner

Reputation: 12207

Inference can be very slow on CPU. The biggest performance boost can be achieved by making sure that llama.cpp uses your GPU by installing the correct drivers and libraries such as CUDA in the supported versions and then compiling llama.cpp with the appropriate compiler flags, see https://github.com/ggerganov/llama.cpp.

If you do not have a supported GPU installed, then a Q/A query bot that answers within seconds is not realistic. ChatGPT runs on many GPUs in parallel. 30-40 minutes still seems excessive, you can reduce the time somewhat by selecting a smaller model such as LLaMA 7b Chat instead of LLaMA 13b Chat and by using a quantized model. However this will also reduce the quality of the output.

Your Intel Core i7-1165G7 also seems to be weak with only 4 CPU cores. Because inference can be run in parallel on different CPU cores, increasing the number of cores has a strong impact on performance. However using a GPU will increase performance much more than even the fastest CPU like an Intel i9-13900k.

Upvotes: 2

Very slow Response from LLM based Q/A query engine

Answers (2)

Related Questions