Reputation: 171
Huggingface transformers library has a pretty awesome feature: it can create a FAISS index on embeddings dataset which allows searching for the nearest neighbors.
train_ds['train'].add_faiss_index("embedding")
scores, sample = train_ds.get_nearest_examples("embedding", query_embedding, k=10)
I'm trying to understand the significance of the scores
and the intuition behind it. For example if we were to relate cosine similarity and the faiss search score this is what we get:
from scipy.spatial.distance import cosine
print("cosine sim", "faiss score")
for i in range(10):
distance = cosine(query_embedding, sample["embedding"][i])
print(np.round(1-distance, 3), scores[i])
we get this:
cosine sim faiss score
0.9983 75.67109
0.9961 149.42262
0.9969 169.43077
0.9939 243.45598
0.9914 284.8134
0.9963 309.41052
0.9934 327.74158
0.9898 330.72858
0.9897 337.12408
0.99 337.60126
As you can see the cosine similarity is pretty much uniform and is very close to 1. However, the faiss search scores vary widely. I'm trying to understand what do these numbers represent and how are they calculated. Can they be used to return search results above some treshold? Documentation doesn't cover this unfortunately
Upvotes: 3
Views: 3835
Reputation: 1
I had the same question, so I ran a test, and it has been confirmed that the function uses Euclidean distance as a metric for scoring. The lower the score, the higher the similarity.
Upvotes: -1
Reputation: 3141
I've send you question to ChatGPT and it confirmed that FAISS score presents the near value to the query. It's answer was:
The scores returned by the FAISS index represent the distances between the query embedding and the nearest embeddings in the dataset. The higher the score, the closer the corresponding example is to the query embedding. (ref: huggingface.co on Semantic search with FAISS)
FAISS uses an algorithm to efficiently compute the distances between vectors and organize them in a way that allows for fast nearest neighbor search. The algorithm uses a combination of quantization and indexing techniques to divide the vector space into smaller subspaces, which makes the search faster and more efficient.
In contrast, the cosine similarity measure computes the cosine of the angle between two vectors, which represents how similar they are in direction. Cosine similarity is a commonly used metric in natural language processing (NLP) and information retrieval to compare the similarity of text documents, as well as in other areas such as image processing and recommendation systems.
In your example, the cosine similarities and FAISS scores are not directly comparable, as they measure different aspects of the similarity between vectors. However, they can be used together to provide a more comprehensive understanding of the nearest neighbor search results.
Upvotes: -2
Reputation: 973
FAISS uses binning and PQ (Product Quantization) to yield approximate answers quickly and requiring considerably less memory. So the score might bounce around because of this approximation. It's not even guaranteed to find all KNN because of the approximation (due to sampling of only some bins, I think).
So yes, you can use a cutoff if you want, realizing the clever shortcuts that FAISS is taking won't ever yield something that's equivalent to cosine similarity. But cosine similarity cannot do the tasks that FAISS can do.
Upvotes: 0