How to multiprocess/multithread documents loading in chromadb?

Question

I'm creating an application with langchain, chromadb and ollama with mistral model, where I have dozens of PDF files, each of them with a lot of pages. The problem is that It takes a lot of time (34min to get 30 PDF files in the vector database) and the streamlit application awaits all this time too to load.

Is there any way to parallelize this database stuff to make all the process faster (regarding the gpu being a real limitation)? How can I separate the streamlit app from the vector database? What's the best way to do this?

That's the way I'm loading the documents:

# document_loader.py
...

def load_local_documents(self):
    loader = DirectoryLoader("test_files/", glob="**/*.pdf")
    self._documents = loader.load()

def get_text_splitted(self):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=50
    )
    return text_splitter.split_documents(self._documents)

# vector_database.py
...

def create_vector_db(self, embeddings):
    self._vector_db = Chroma(persist_directory="chroma_db",
                             embedding_function=embeddings)
    self._vector_db.persist()

def create_vector_db_from_documents(self, texts, embeddings):
    self._vector_db = Chroma.from_documents(
        documents=texts,
        embedding=embeddings,
        persist_directory="chroma_db"
    )
    self._vector_db.persist()

How to multiprocess/multithread documents loading in chromadb?

Answers (1)

Related Questions