Semantic-search in large documents

Question

I am working on a project where I need to develop a program that identifies sentences from a predefined list, within a large document. The goal is to find the closest matches based on semantic meaning, as the sentences may not be exactly identical. For instance, "What is your name" should match with "Could you please tell me your name" since they convey the same meaning.

Currently, I am employing a sentence transformer to convert each line to embeddings and utilizing util.semantic_search to compare these embeddings against the embeddings of the 40 target sentences. Here's a snippet of my code:

for line, target in zip(self.target_sentences,self.target_embeddings):
        hits = util.semantic_search(target,line_embeddings,top_k=1)[0][0]
        if round(hits['score'],2) >=0.85:
            print(line)
            score+=1

target_embeddings are those 40 sentences and line_embeddings is the embeddings for the entire document.I am using model = "multi-qa-MiniLM-L6-dot-v1" and sentenceTransformer.

While I am not sure this is the best approach, I feel it's relatively slow and am unsure whether it’s the most efficient way to address this problem. I am looking for advice on how to optimize this process and whether there are better approaches or technologies that could make this search more efficient and accurate.

Semantic-search in large documents

Answers (1)

Related Questions