Reputation: 112
I have a preexisting database with around 15 PDFs stored. I want to be able to search the database so that I'm getting the X most relevant results back given a certain threshold using cosine similarity.
Currently, I've defined a collection using this code:
chroma_client = chromadb.PersistentClient(path="TEST_EMBEDDINGS/CHUNK_EMBEDDINGS")
collection = chroma_client.get_or_create_collection(name="CHUNK_EMBEDDINGS")
I've done a bit of research and it seems to me that while ChromaDB does not have a similarity search, FAISS does. However, the existing solutions online describe to do something along the lines of this:
from langchain.vectorstores import Chroma
db = Chroma.from_documents(texts, embeddings)
docs_score = db.similarity_search_with_score(query=query, distance_metric="cos", k = 6)
I am unsure how I can integrate this code or if there are better solutions.
Upvotes: 3
Views: 8298
Reputation: 1169
Chroma uses some funky distance metrics. I started freaking out when I got values greater than one. Chroma distance is the L2 norm squared so, in a unit hypersphere (vectors normed to unity) you could conceivably have distance = 4.
Cosine similarity, which is just the dot product, Chroma recasts as cosine distance by subtracting it from one. So, where you would normally search for high similarity, you will want low distance.
I should add that all the popular embeddings use normed vectors, so the denominator of that expression is just = 1.
Upvotes: 0
Reputation: 782
ChromaDB does have similarity search. The default is L2, but you can change it as documented here.
collection = client.create_collection(
name="collection_name",
metadata={"hnsw:space": "cosine"} # l2 is the default
)
Upvotes: 3