Reputation: 61
I have two tables containing 2 million records each. One has the item names and other item description along with other attributes. I have to match each item in table 1 with each description in table 2 to find maximum similarity matches. So basically, for each of 2 million items, I have to scan the other table to find best match. That makes 2 million * 2 million computations! How do I go about doing that in python efficiently? As it stands now, it will take years to compute.
Right now the approach I am following is regex search by splitting each item name into words in a list and then checking if the word is contained in description or not.If yes, then I increase the match count by 1 and using that I calculate similarity.
So my question(s) is :
How to make my computations faster? Use multithreading, split data or anything like this?
Any other similarity algorithm that will work here? Please note that I have description on the other side, so cosine similarity etc don't work because of differing number of words.
Upvotes: 2
Views: 652
Reputation: 729
you can use NLTK as well.
from nltk import *
reference = 'DET NN VB DET JJ NN NN IN DET NN'.split()
test = 'DET VB VB DET NN NN NN IN DET NN'.split()
print(accuracy(reference, test))
print edit_distance("rain", "shine")
Upvotes: 0
Reputation: 2188
You could try the Distance package to calculate the Levenshtein Distance for similarity.
From the documentation:
Comparing lists of strings can also be useful for computing similarities between sentences, paragraphs, etc., in articles or books, as for plagiarism recognition:
>>> sent1 = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
>>> sent2 = ['the', 'lazy', 'fox', 'jumps', 'over', 'the', 'crazy', 'dog']
>>> distance.levenshtein(sent1, sent2)
3
Or the python-Levenshtein package:
>>> distance('Levenshtein', 'Lenvinsten')
4
>>> distance('Levenshtein', 'Levensthein')
2
>>> distance('Levenshtein', 'Levenshten')
1
>>> distance('Levenshtein', 'Levenshtein')
0
Upvotes: 0