Reputation: 3369
I am trying to do semantic search with sentence transformer and faiss.
I am able to generate emebdding from corpus and perform query with the query xq
.
But what are t
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("flax-sentence-embeddings/st-codesearch-distilroberta-base")
def get_embeddings(code_snippets: str):
return model.encode(code_snippets)
def build_vector_database(atlas_datapoints):
dimension = 768 # dimensions of each vector
corpus = ["tom loves candy",
"this is a test"
"hello world"
"jerry loves programming"]
code_snippet_emddings = get_embeddings(corpus)
print(code_snippet_emddings.shape)
d = code_snippet_emddings.shape[1]
index = faiss.IndexFlatL2(d)
print(index.is_trained)
index.add(code_snippet_emddings)
print(index.ntotal)
k = 2
xq = model.encode(["jerry loves candy"])
D, I = index.search(xq, k) # search
print(I)
print(D)
This code returns
[[0 1]]
[[1.3480902 1.6274161]]
But I cant find which sentence xq
is matching with and not the matching scores only.
How can I find the top-N matching string from the corpus.
Upvotes: 2
Views: 566
Reputation: 590
You forgot some commas in your corpus, thus you only passed two sentences in the corpus.
As per the result, your ids are sequential since the IndexFlatL2 does not provide an add_with_ids method (unless you wrap it with an IndexIDMap) as stated in the documentation.
The [[0 1]]
represent the indexes of the array and then you have the respective score. But if you expected 4 values it is because of the missing commas
Upvotes: 0
Reputation: 122052
To retrieve the query results, try something like this using the variables from your code.
[corpus[I] for i in I]
But if you have corpus as a np.array
object, you can do some cool slicing like this:
import numpy as np
# If you corpus are in array form.
corpus = np.array(['abc def', 'foo bar', 'bar bar sheep'])
# And indices can be list of integers.
indices = [1,0]
# Results.
corpus[indices]
And it can get a little cooler if your indices are already np.array, like output of faiss, and if you have 2 queries with 1x2xk
results:
import numpy as np
corpus = np.array(['abc def', 'foo bar', 'bar bar sheep'])
indices = np.array([[1,0], [0,2]])
corpus[indices]
The faiss.IndexFlatL2
object returns these through the search()
function:
I
in your code snippet refers to indices of the top-K resultsD
in your code snippet referring to the distance of the top-K results from your query string.Since you have only 1 query, the n=1
, therefore your I
and D
matrice are of size 1x1xk
.
See also:
Upvotes: 1