Sarthak Patel
Sarthak Patel

Reputation: 11

Semantic-search in large documents

I am working on a project where I need to develop a program that identifies sentences from a predefined list, within a large document. The goal is to find the closest matches based on semantic meaning, as the sentences may not be exactly identical. For instance, "What is your name" should match with "Could you please tell me your name" since they convey the same meaning.

Currently, I am employing a sentence transformer to convert each line to embeddings and utilizing util.semantic_search to compare these embeddings against the embeddings of the 40 target sentences. Here's a snippet of my code:

for line, target in zip(self.target_sentences,self.target_embeddings):
        hits = util.semantic_search(target,line_embeddings,top_k=1)[0][0]
        if round(hits['score'],2) >=0.85:
            print(line)
            score+=1

target_embeddings are those 40 sentences and line_embeddings is the embeddings for the entire document.I am using model = "multi-qa-MiniLM-L6-dot-v1" and sentenceTransformer.

While I am not sure this is the best approach, I feel it's relatively slow and am unsure whether it’s the most efficient way to address this problem. I am looking for advice on how to optimize this process and whether there are better approaches or technologies that could make this search more efficient and accurate.

Upvotes: 0

Views: 1191

Answers (1)

Deepak Kumar
Deepak Kumar

Reputation: 523

You can explore Approximate Nearest Neighbor (ANN) methods for faster search. FAISS and Annoy are some popular library for ANN.

SBERT function "sentence_transformers.util.semantic_search" searches a query by finding similarity of query-embedding against all the embedding record in the search corpus. for example lets say your search corpus has 10 million record, for a single search it will find 10 million similarity (cosine-sim) scores and provide you one with top score.

you can slightly improve the search time by properly configuring parameters (query_chunk_size, corpus_chunk_size) that affect parallelization in similarity calculation.

If your corpus size is big, I would suggest use some ANN technique to partition your corpus embeddings into smaller fractions of similar embeddings. The embeddings with the highest similarity (the nearest neighbors) can be retrieved within milliseconds, even if you have millions of record in the corpus.

For sample implementation of FAISS indexing you can refer https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/semantic-search/semantic_search_quora_faiss.py

code snippet from this reference source file

### Create the FAISS index
index = faiss.IndexIVFFlat(quantizer, embedding_size, n_clusters, faiss.METRIC_INNER_PRODUCT)
# First, we need to normalize vectors to unit length
corpus_embeddings = corpus_embeddings / np.linalg.norm(corpus_embeddings, axis=1)[:, None]
# Then we train the index to find a suitable clustering
index.train(corpus_embeddings)
# Finally we add all embeddings to the index
index.add(corpus_embeddings)

Upvotes: 0

Related Questions