Sharhad
Sharhad

Reputation: 83

A machine Learning model to find similarities between two words in python

I have 2 lists of words. The first list contains 5 words. The second list contains 1000s of words.

I am looking for a ML model that will help me find the best match between the words in the first list to the words in the second list, by assigning a score between all possible pairs from the first list to the second list. Highest score means best match. score of 1.0 means perfect match.

For example, list A has the word Light Classic. List B has Classical Music, Rock and Opera.

  1. Score between Light Classic and Classical Music is 0.82
  2. Score between Light Classic and Rock is 0.23
  3. Score between Light Classic and Opera is 0.54

Therefore the best match for Light Classic is Classical Music

This image shows more examples: Mapping between List A and List B

Currently I am using sentence_transformers and all-mpnet-base-v2 to find these scores. For scores, I am using cosine score. The code I am using is shown below:

from tqdm import tqdm
from sentence_transformers import SentenceTransformer, util

model_name = 'all-mpnet-base-v2'
model = SentenceTransformer(model_name)

mapping = {}
for word_a in list_a:
    word_a = word_a.lower()
    mapping[word_a] = {
        'score': 0
        'list_b': ''
    }
    embedding_word_a = model.encode(word_a, convert_to_tensor = True)

    for word_b in list_b:
        word_b = word_b.lower()

        if (word_a == word_b):
            mapping[word_a]['score'] = 1.0
            mapping[word_a]['list_b'] = word_b
            break

        embedding_word_b = model.encode(word_b, convert_to_tensor = True)
            cosine_score = round((util.cos_sim(embedding_word_a, embedding_word_b)).item(), 2)
            if (cosine_score > mapping[word_a]['score']):
                mapping[word_a]['score'] = cosine_score
                mapping[word_a]['list_b'] = word_b
print(mapping)

While this works fine, I have two questions:

  1. The model I am using has an average performance of 63.30 here. Is there a better approach or model or method that I can use?
  2. This is pretty slow, as I am comparing each of the 5 words in list a to all the 1000+ words in list b, thus it is slow. Is there a faster approach or model?

I am on python v3.9.16

Upvotes: 1

Views: 933

Answers (1)

Vegan Chili
Vegan Chili

Reputation: 63

I don't know if you can use a better model, besides using other bert-models like bert-base-uncased, but there is a way to not have a double for-loop.

Check out the documentation for util.cos_sim1. It says you can give it two tensors, and it will return a matrix of the cosine similarities between them.

So, if list_b is not extraordinarily large, instead of using a for-loop, you can save each of the embeddings in a separate torch-tensor:

embs_a = torch.zeros(5, 768)
embs_b = torch.zeros(size_of_b, 768)

You can then loop over the lists individually and assign the embedding to their respective tensor as so:

for i, word_a in enumerate(list_a):
    embedding_word_a = model.encode(word_a, convert_to_tensor = True)
    embs_a[i] = embedding_word_a
for i, word_b in enumerate(list_b):
    embedding_word_b = model.encode(word_b, convert_to_tensor = True)
    embs_b[i] = embedding_word_b

Finally, you can get the matrix of cosine similarities like so (I removed the rounding):

cos_sims = util.cos_sim(a_embs, b_embs)

Here, the rows are words from list_a, and the columns are words from list_b. So, cos_sims[0, 1] will be the cosine similarity between the first word from list_a and the second word from list_b

Then, for each word in list_a, you can get best matching word from list_b as follows:

scores, indices = torch.max(util.cos_sim(a_embs, b_embs), dim=-1)
mapping = dict()
for i, idx in enumerate(indices):
    mapping[list_a[i]] = {"list_b": list_b[idx], "score": scores[i]}

As a final note: if you're working with tensors/numpy arrays and you're using a function specifically designed to process them, chances are this function is implemented efficiently in such a way that you can avoid for-loops.

Upvotes: 1

Related Questions