LangChain based Streamlit RAG App: chunks and Vectorstore is being recomputed on every question?

Question

I'm building a Retrieval Augmented Generation (RAG) pipeline using LangChain, and I'm encountering an issue where my vectorstore seems to be recomputed every time I pass a new question to the pipeline. This leads to significant performance degradation, especially when dealing with large datasets.

Here's my setup:

I'm using Chroma as my vectorstore, EnsembleRetirever as my retriever, UnstructuredPDFLoader() from langchain to extract the text along with metadata of the documents.

I'm loading my documents from a file/database and splitting them into chunks using a text splitter. I'm embedding these chunks using OllamaEmbeddings.

I'm using the retrieval chain function to combine the retriever and a question-answering chain.

The problem:

When I run the pipeline with different questions, it seems like the vectorstore is being created anew each time, instead of reusing the existing embeddings. This is causing a significant performance bottleneck. like the following is output of streamlit app, I am carefully trying to time the part which takes the longest in my pipeline, and it clearly shows that the it does rechunking and recomputating of vectorstore all over again.

streamlit run IMS_RAG_m2.py

  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8504
  Network URL: http://10.54.27.49:8504

---------->Entered LOAD-CHUNK-EMBED<--------------
Collection already exists
Extracting text took this much time --> 306.8951711868867
Created a chunk of size 2274, which is longer than the specified 1024
Created a chunk of size 1538, which is longer than the specified 1024
Created a chunk of size 1742, which is longer than the specified 1024
Created a chunk of size 1040, which is longer than the specified 1024
Created a chunk of size 1730, which is longer than the specified 1024
Created a chunk of size 1052, which is longer than the specified 1024
Created a chunk of size 1128, which is longer than the specified 1024
Created a chunk of size 1561, which is longer than the specified 1024
Created a chunk of size 1368, which is longer than the specified 1024
Created a chunk of size 3190, which is longer than the specified 1024
Created a chunk of size 1467, which is longer than the specified 1024
no of chunks 637
Chunking took this much time --> 0.02095871907658875
Collection already exists
Embedding took this much time --> 16.116066633025184
Retrieval took this much time --> 0.04899920406751335
/home/sarmad/anaconda3/envs/RAG/lib/python3.10/site-packages/langchain_core/_api/deprecation.py:141: LangChainDeprecationWarning: Since Chroma 0.4.x the manual persistence method is no longer supported as docs are automatically persisted.
  warn_deprecated(
The timetaken for a global load and chunk and embed 323.08347079413943
Query Processing.......
---------->Entered LOAD-CHUNK-EMBED<--------------
Collection already exists
Extracting text took this much time --> 299.51400842610747
Created a chunk of size 2274, which is longer than the specified 1024
Created a chunk of size 1538, which is longer than the specified 1024
Created a chunk of size 1742, which is longer than the specified 1024
Created a chunk of size 1040, which is longer than the specified 1024
Created a chunk of size 1730, which is longer than the specified 1024
Created a chunk of size 1052, which is longer than the specified 1024
Created a chunk of size 1128, which is longer than the specified 1024
Created a chunk of size 1561, which is longer than the specified 1024
Created a chunk of size 1368, which is longer than the specified 1024
Created a chunk of size 3190, which is longer than the specified 1024
Created a chunk of size 1467, which is longer than the specified 1024
no of chunks 637
Chunking took this much time --> 0.02046454302035272
Collection already exists
Embedding took this much time --> 15.863782780012116
Retrieval took this much time --> 0.06083990586921573
The timetaken for a global load and chunk and embed 315.4610354742035
Query Processing.......

My questions:

Is this expected behavior for LangChain's RAG pipeline? Should the vectorstore be recomputed for every question?

If not, how can I ensure that the vectorstore is only computed once and reused across multiple queries?

Are there any best practices or configurations for LangChain that can help optimize the vectorstore creation and usage in a RAG pipeline?

I tried setting the retrieved_chunks and vectorstore outside of the function meaning thereby making it Global, so that it is accessible by all of the other functions in the pipeline but it still recomputes all of the chunks and vectorestore all over again making my RAG app slower than it should be.

Here's my code:

from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.question_answering import load_qa_chain
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
import chromadb
from langchain_community.llms import Ollama
from langchain_community.document_loaders import UnstructuredPDFLoader
from htmlTemplates import css, bot_template, user_template
from langchain.load import dumps, loads
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain.prompts import ChatPromptTemplate
import timeit


def load_chunk_persist_pdf() -> Chroma:
    pdf_folder_path = "./IMS/"
    documents = []
    for file in os.listdir(pdf_folder_path):
        if file.endswith('.pdf'):
            pdf_path = os.path.join(pdf_folder_path, file)
            loader = UnstructuredPDFLoader(pdf_path)
            documents.extend(loader.load())
    text_splitter = CharacterTextSplitter(chunk_size=1024, chunk_overlap=200)
    start0 = timeit.default_timer()
    chunked_documents = text_splitter.split_documents(documents)
    print("no of chunks", len(chunked_documents))
    print("Chunking took this much time -->",timeit.default_timer()-start0)

    client = chromadb.Client()
    if client.list_collections():
        consent_collection = client.create_collection("consent_collection")
        print("----------Fresh Collection created-----------")
    else:
        print("----------Collection already exists-----------")

    print(chunked_documents[2])
    embedding_function = OllamaEmbeddings(base_url="http://localhost:11434", model="nomic-embed-text")

    start11 = timeit.default_timer()
    vectordb = Chroma.from_documents(chunked_documents, embedding_function, persist_directory="./testing_space_IMS_TESTING/chroma_store/")
    print("Embedding took this much time -->",timeit.default_timer()-start11)

    retriever = vectordb.as_retriever()
    keyword_retriever = BM25Retriever.from_documents(chunked_documents)
    keyword_retriever.k = 3

    ensemble_retriever = EnsembleRetriever(retrievers=[retriever,
                                                       keyword_retriever],
                                           weights=[0.3, 0.7])

    vectordb.persist()
    return ensemble_retriever





def get_llm_response(query):
    start = timeit.default_timer()
    en_retriever = load_chunk_persist_pdf()
    print("The time taken for testing version to load and embed:",
          timeit.default_timer() - start)

    llm = Ollama(model="llama3.1:70b")

    template1 = """You are an assistant for question-answering tasks. 
    Use the following pieces of retrieved context to answer the question. 
    If you don't know the answer, just say that you don't know.
    Question: {question} 
    Context: {context} 
    Answer:
    """

    Retrieved_docs = en_retriever.invoke(query)
    print("retrieved-documents",Retrieved_docs)


    prompt1 = ChatPromptTemplate.from_template(template1)
    rag_chain = (
            {"context": en_retriever, "question": RunnablePassthrough()}
            | prompt1
            | llm
            | StrOutputParser()
    )

    start1 = timeit.default_timer()
    answer1 = rag_chain.invoke(query)
    print("The time taken for query processing is :",
          timeit.default_timer() - start1)


    return answer1



def save_uploaded_file(uploadedfile):
  with open(os.path.join(doc_path,uploadedfile.name),"wb") as f:
     f.write(uploadedfile.getbuffer())
  return st.success("File Saved".format(uploadedfile.name))

doc_path = "./IMS/"

st.set_page_config(page_title="Chat with IMS Documents - TESTING", page_icon=":books:")
st.header("Chat with Custom LLM about IMS Documents - TESTING :books:")
# Set up the sidebar
st.sidebar.title("Document's List:")
with st.sidebar:
    st.subheader("Your documents")
    doc = st.file_uploader("Upload your PDF documents here...", accept_multiple_files=True)
    if doc is not None:
        for file in doc:
            save_uploaded_file(file)
            print("New_file", file)
    # Display the names in the sidebar
    file_names = [f for f in os.listdir(doc_path) if os.path.isfile(os.path.join(doc_path, f))]
    for name in file_names:
        st.sidebar.write(name)

form_input = st.text_input('Ask questions from your documents..')
#print("Query Processing.......")
#
if len(form_input) != 0:
    st.write(bot_template.replace("{{MSG}}", get_llm_response(form_input)), unsafe_allow_html=True)

LangChain based Streamlit RAG App: chunks and Vectorstore is being recomputed on every question?

Answers (0)

Related Questions