Reputation: 13
I'm working on a project that uses llama_index to retrieve document information in Jupyter Notebook, but I'm experiencing very slow query response times (around 15 minutes per query). I'm using the following code:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
documents = SimpleDirectoryReader("C:path/example/data").load_data()
# Using bge-base embedding model
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")
# Setting up Ollama LLM with a timeout of 1 hour
Settings.llm = Ollama(model="llama3", request_timeout=3600.0)
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)
I’m running this on a localhost Jupyter Notebook, and it consistently takes 15 minutes or longer to return results.
Reducing request_timeout to speed up the query, but it results in a ReadTimeout error
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
documents = SimpleDirectoryReader("C:path/example/data").load_data()
# Using bge-base embedding model
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")
# Setting up Ollama LLM with a timeout of 1 hour
Settings.llm = Ollama(model="llama3", request_timeout=60.0)
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)
How can I speed up the response time when querying documents? Are there ways to optimize or fully use a local model to improve retrieval speed? Specifically, is there a way to handle embeddings and LLM processing locally to avoid network latency or timeouts? Any help on reducing retrieval time or configuring a local model setup
Upvotes: 0
Views: 127
Reputation: 375
As @AKX commented, you'll need to test and tell us the time taken in different areas of your code. For example, the document indexing line, and the querying line.
You can try using a lighter LLM like Llama 2 7b Q4_K_M.
If using LlamaCPP, use model_kwargs={"n_gpu_layers": -1}
to use GPU for faster inferences. For example:
llm_url = 'https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf'
llm = LlamaCPP(model_url=llm_url, temperature=0.7, max_new_tokens=256, context_window=4096, generate_kwargs = {"stop": ["", "[INST]", "[/INST]"]}, model_kwargs={"n_gpu_layers": -1}, verbose=True)
You can refer to my full working script I made a while ago that takes ~3 seconds to index documents, and ~5 seconds to see the first generated token. https://colab.research.google.com/github/kazcfz/LlamaIndex-RAG/blob/main/LlamaIndex_RAG.ipynb
Upvotes: 0