Yash Sharma
Yash Sharma

Reputation: 112

Searching existing ChromaDB database using cosine similarity

I have a preexisting database with around 15 PDFs stored. I want to be able to search the database so that I'm getting the X most relevant results back given a certain threshold using cosine similarity.

Currently, I've defined a collection using this code:

chroma_client = chromadb.PersistentClient(path="TEST_EMBEDDINGS/CHUNK_EMBEDDINGS")
collection = chroma_client.get_or_create_collection(name="CHUNK_EMBEDDINGS")

I've done a bit of research and it seems to me that while ChromaDB does not have a similarity search, FAISS does. However, the existing solutions online describe to do something along the lines of this:

from langchain.vectorstores import Chroma
db = Chroma.from_documents(texts, embeddings)
docs_score = db.similarity_search_with_score(query=query, distance_metric="cos", k = 6)

I am unsure how I can integrate this code or if there are better solutions.

Upvotes: 3

Views: 8298

Answers (2)

Mark
Mark

Reputation: 1169

Chroma uses some funky distance metrics. I started freaking out when I got values greater than one. Chroma distance is the L2 norm squared so, in a unit hypersphere (vectors normed to unity) you could conceivably have distance = 4.

Cosine similarity, which is just the dot product, Chroma recasts as cosine distance by subtracting it from one. So, where you would normally search for high similarity, you will want low distance.

Chroma distance metrics

I should add that all the popular embeddings use normed vectors, so the denominator of that expression is just = 1.

Upvotes: 0

Chris Mungall
Chris Mungall

Reputation: 782

ChromaDB does have similarity search. The default is L2, but you can change it as documented here.

collection = client.create_collection(
    name="collection_name",
    metadata={"hnsw:space": "cosine"} # l2 is the default
)

Upvotes: 3

Related Questions