BM25Retriever + ChromaDB Hybrid Search Optimization using LangChain

Question

For those who have integrated the ChromaDB client with the Langchain framework, I am proposing the following approach to implement the Hybrid search (Vector Search + BM25Retriever):

from langchain_chroma import Chroma
import chromadb
from chromadb.config import Settings
from langchain_openai import OpenAIEmbeddings
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_core.documents import Document
from langgraph.graph import START, StateGraph
from typing_extensions import TypedDict
 
 
# Assuming that you have instantiated Chroma client and integrate it into Langchain (below is an example)
“””
persistent_client = chromadb.PersistentClient(path=”./test”, settings=Settings(allow_reset=True))
collection = persistent_client.get_or_create_collection(
            name=”example”,
            metadata={
                "hnsw:space": "cosine",
                # you can add other HNSW parameters if you want
            }
        )
 
chroma = Chroma(
                        client=persistent_client,
                        collection_name=collection.name,
                        embedding_function= OpenAIEmbeddings(model="text-embedding-3-large"))
“””
 
def hybrid_search(self, query: str, k: int = 5):
        """Perform a Hybrid Search (similarity_search + BM25Retriever) in the collection."""
        # Get all raw documents from the ChromaDB
        raw_docs = chroma.get(include=["documents", "metadatas"])
        # Convert them in Document object
        documents = [
            Document(page_content=doc, metadata=meta)
            for doc, meta in zip(raw_docs["documents"], raw_docs["metadatas"])
        ]
      # Create BM25Retriever from the documents
        bm25_retriever = BM25Retriever.from_documents(documents=documents, k=k)
      # Create vector search retriever from ChromaDB instance
        similarity_search_retriever = self.chroma.as_retriever(
                search_type="similarity",
                search_kwargs={'k': k}
            )
       # Ensemble the retrievers using Langchain’s EnsembleRetriever Object
        ensemble_retriever = EnsembleRetriever(retrievers=[similarity_search_retriever, bm25_retriever], weights=[0.5, 0.5])
        # Retrieve k relevant documents for the query
        return ensemble_retriever.invoke(query) # If needed, we can use ainvoke(query) method to retrieve the docs asynchrounously
 
# Call hybrid_search() method
# Graph Nodes State approach
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str
 
# --- Define Graph Nodes (retrieve, generate, etc.) ---
def retrieve(state: State) -> dict:
    retrieved_docs = vector_store.hybrid_search(state["question"], 3)
    return {"context": retrieved_docs}

Note: The above code is just a sequence that contains exclusively the retrieval component to be further integrated into the application structure and RAG flow.

My question is the following: Is there a better approach (simpler or cleaner code) that can be used for retrieval of millions of documents?

BM25Retriever + ChromaDB Hybrid Search Optimization using LangChain

Answers (0)

Related Questions