A machine Learning model to find similarities between two words in python

Question

I have 2 lists of words. The first list contains 5 words. The second list contains 1000s of words.

I am looking for a ML model that will help me find the best match between the words in the first list to the words in the second list, by assigning a score between all possible pairs from the first list to the second list. Highest score means best match. score of 1.0 means perfect match.

For example, list A has the word Light Classic. List B has Classical Music, Rock and Opera.

Score between Light Classic and Classical Music is 0.82
Score between Light Classic and Rock is 0.23
Score between Light Classic and Opera is 0.54

Therefore the best match for Light Classic is Classical Music

This image shows more examples:

Currently I am using sentence_transformers and all-mpnet-base-v2 to find these scores. For scores, I am using cosine score. The code I am using is shown below:

from tqdm import tqdm
from sentence_transformers import SentenceTransformer, util

model_name = 'all-mpnet-base-v2'
model = SentenceTransformer(model_name)

mapping = {}
for word_a in list_a:
    word_a = word_a.lower()
    mapping[word_a] = {
        'score': 0
        'list_b': ''
    }
    embedding_word_a = model.encode(word_a, convert_to_tensor = True)

    for word_b in list_b:
        word_b = word_b.lower()

        if (word_a == word_b):
            mapping[word_a]['score'] = 1.0
            mapping[word_a]['list_b'] = word_b
            break

        embedding_word_b = model.encode(word_b, convert_to_tensor = True)
            cosine_score = round((util.cos_sim(embedding_word_a, embedding_word_b)).item(), 2)
            if (cosine_score > mapping[word_a]['score']):
                mapping[word_a]['score'] = cosine_score
                mapping[word_a]['list_b'] = word_b
print(mapping)

While this works fine, I have two questions:

The model I am using has an average performance of 63.30 here. Is there a better approach or model or method that I can use?
This is pretty slow, as I am comparing each of the 5 words in list a to all the 1000+ words in list b, thus it is slow. Is there a faster approach or model?

I am on python v3.9.16

A machine Learning model to find similarities between two words in python

Answers (1)

Related Questions