Huggingface Transformers FAISS index scores

Huggingface transformers library has a pretty awesome feature: it can create a FAISS index on embeddings dataset which allows searching for the nearest neighbors.

train_ds['train'].add_faiss_index("embedding")
scores, sample = train_ds.get_nearest_examples("embedding", query_embedding, k=10)

I'm trying to understand the significance of the scores and the intuition behind it. For example if we were to relate cosine similarity and the faiss search score this is what we get:

from scipy.spatial.distance import cosine
print("cosine sim", "faiss score")
for i in range(10):
  distance = cosine(query_embedding, sample["embedding"][i])
  print(np.round(1-distance, 3), scores[i])

we get this:

cosine sim   faiss score
0.9983       75.67109
0.9961       149.42262
0.9969       169.43077
0.9939       243.45598
0.9914       284.8134
0.9963       309.41052
0.9934       327.74158
0.9898       330.72858
0.9897       337.12408
0.99         337.60126

As you can see the cosine similarity is pretty much uniform and is very close to 1. However, the faiss search scores vary widely. I'm trying to understand what do these numbers represent and how are they calculated. Can they be used to return search results above some treshold? Documentation doesn't cover this unfortunately

Upvotes: 3

Answers (3)

alyxx

Reputation: 1

I had the same question, so I ran a test, and it has been confirmed that the function uses Euclidean distance as a metric for scoring. The lower the score, the higher the similarity.

see image link

Upvotes: -1

Murilo Maciel Curti

Reputation: 3141

I've send you question to ChatGPT and it confirmed that FAISS score presents the near value to the query. It's answer was:

The scores returned by the FAISS index represent the distances between the query embedding and the nearest embeddings in the dataset. The higher the score, the closer the corresponding example is to the query embedding. (ref: huggingface.co on Semantic search with FAISS)

FAISS uses an algorithm to efficiently compute the distances between vectors and organize them in a way that allows for fast nearest neighbor search. The algorithm uses a combination of quantization and indexing techniques to divide the vector space into smaller subspaces, which makes the search faster and more efficient.

In contrast, the cosine similarity measure computes the cosine of the angle between two vectors, which represents how similar they are in direction. Cosine similarity is a commonly used metric in natural language processing (NLP) and information retrieval to compare the similarity of text documents, as well as in other areas such as image processing and recommendation systems.

In your example, the cosine similarities and FAISS scores are not directly comparable, as they measure different aspects of the similarity between vectors. However, they can be used together to provide a more comprehensive understanding of the nearest neighbor search results.

Upvotes: -2

Wayne

Reputation: 973

FAISS uses binning and PQ (Product Quantization) to yield approximate answers quickly and requiring considerably less memory. So the score might bounce around because of this approximation. It's not even guaranteed to find all KNN because of the approximation (due to sampling of only some bins, I think).

So yes, you can use a cutoff if you want, realizing the clever shortcuts that FAISS is taking won't ever yield something that's equivalent to cosine similarity. But cosine similarity cannot do the tasks that FAISS can do.

Upvotes: 0

Huggingface Transformers FAISS index scores

Answers (3)

Related Questions