Reputation: 177
For those who have integrated the ChromaDB client with the Langchain framework, I am proposing the following approach to implement the Hybrid search (Vector Search + BM25Retriever):
from langchain_chroma import Chroma
import chromadb
from chromadb.config import Settings
from langchain_openai import OpenAIEmbeddings
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_core.documents import Document
from langgraph.graph import START, StateGraph
from typing_extensions import TypedDict
# Assuming that you have instantiated Chroma client and integrate it into Langchain (below is an example)
“””
persistent_client = chromadb.PersistentClient(path=”./test”, settings=Settings(allow_reset=True))
collection = persistent_client.get_or_create_collection(
name=”example”,
metadata={
"hnsw:space": "cosine",
# you can add other HNSW parameters if you want
}
)
chroma = Chroma(
client=persistent_client,
collection_name=collection.name,
embedding_function= OpenAIEmbeddings(model="text-embedding-3-large"))
“””
def hybrid_search(self, query: str, k: int = 5):
"""Perform a Hybrid Search (similarity_search + BM25Retriever) in the collection."""
# Get all raw documents from the ChromaDB
raw_docs = chroma.get(include=["documents", "metadatas"])
# Convert them in Document object
documents = [
Document(page_content=doc, metadata=meta)
for doc, meta in zip(raw_docs["documents"], raw_docs["metadatas"])
]
# Create BM25Retriever from the documents
bm25_retriever = BM25Retriever.from_documents(documents=documents, k=k)
# Create vector search retriever from ChromaDB instance
similarity_search_retriever = self.chroma.as_retriever(
search_type="similarity",
search_kwargs={'k': k}
)
# Ensemble the retrievers using Langchain’s EnsembleRetriever Object
ensemble_retriever = EnsembleRetriever(retrievers=[similarity_search_retriever, bm25_retriever], weights=[0.5, 0.5])
# Retrieve k relevant documents for the query
return ensemble_retriever.invoke(query) # If needed, we can use ainvoke(query) method to retrieve the docs asynchrounously
# Call hybrid_search() method
# Graph Nodes State approach
class State(TypedDict):
question: str
context: List[Document]
answer: str
# --- Define Graph Nodes (retrieve, generate, etc.) ---
def retrieve(state: State) -> dict:
retrieved_docs = vector_store.hybrid_search(state["question"], 3)
return {"context": retrieved_docs}
Note: The above code is just a sequence that contains exclusively the retrieval component to be further integrated into the application structure and RAG flow.
My question is the following: Is there a better approach (simpler or cleaner code) that can be used for retrieval of millions of documents?
Upvotes: 0
Views: 40