Reputation: 83
I have 2 lists of words. The first list contains 5 words. The second list contains 1000s of words.
I am looking for a ML model that will help me find the best match between the words in the first list to the words in the second list, by assigning a score between all possible pairs from the first list to the second list. Highest score means best match. score of 1.0 means perfect match.
For example, list A has the word Light Classic
. List B has Classical Music
, Rock
and Opera
.
Light Classic
and Classical Music
is 0.82Light Classic
and Rock
is 0.23Light Classic
and Opera
is 0.54Therefore the best match for Light Classic
is Classical Music
This image shows more examples:
Currently I am using sentence_transformers
and all-mpnet-base-v2
to find these scores. For scores, I am using cosine score
. The code I am using is shown below:
from tqdm import tqdm
from sentence_transformers import SentenceTransformer, util
model_name = 'all-mpnet-base-v2'
model = SentenceTransformer(model_name)
mapping = {}
for word_a in list_a:
word_a = word_a.lower()
mapping[word_a] = {
'score': 0
'list_b': ''
}
embedding_word_a = model.encode(word_a, convert_to_tensor = True)
for word_b in list_b:
word_b = word_b.lower()
if (word_a == word_b):
mapping[word_a]['score'] = 1.0
mapping[word_a]['list_b'] = word_b
break
embedding_word_b = model.encode(word_b, convert_to_tensor = True)
cosine_score = round((util.cos_sim(embedding_word_a, embedding_word_b)).item(), 2)
if (cosine_score > mapping[word_a]['score']):
mapping[word_a]['score'] = cosine_score
mapping[word_a]['list_b'] = word_b
print(mapping)
While this works fine, I have two questions:
I am on python v3.9.16
Upvotes: 1
Views: 933
Reputation: 63
I don't know if you can use a better model, besides using other bert-models like bert-base-uncased
, but there is a way to not have a double for-loop.
Check out the documentation for util.cos_sim1. It says you can give it two tensors, and it will return a matrix of the cosine similarities between them.
So, if list_b
is not extraordinarily large, instead of using a for-loop, you can save each of the embeddings in a separate torch-tensor:
embs_a = torch.zeros(5, 768)
embs_b = torch.zeros(size_of_b, 768)
You can then loop over the lists individually and assign the embedding to their respective tensor as so:
for i, word_a in enumerate(list_a):
embedding_word_a = model.encode(word_a, convert_to_tensor = True)
embs_a[i] = embedding_word_a
for i, word_b in enumerate(list_b):
embedding_word_b = model.encode(word_b, convert_to_tensor = True)
embs_b[i] = embedding_word_b
Finally, you can get the matrix of cosine similarities like so (I removed the rounding):
cos_sims = util.cos_sim(a_embs, b_embs)
Here, the rows are words from list_a
, and the columns are words from list_b
. So, cos_sims[0, 1]
will be the cosine similarity between the first word from list_a
and the second word from list_b
Then, for each word in list_a
, you can get best matching word from list_b
as follows:
scores, indices = torch.max(util.cos_sim(a_embs, b_embs), dim=-1)
mapping = dict()
for i, idx in enumerate(indices):
mapping[list_a[i]] = {"list_b": list_b[idx], "score": scores[i]}
As a final note: if you're working with tensors/numpy arrays and you're using a function specifically designed to process them, chances are this function is implemented efficiently in such a way that you can avoid for-loops.
Upvotes: 1