stkusr1234
stkusr1234

Reputation: 61

How to make text search and similarity computation across millions of records efficient in python

I have two tables containing 2 million records each. One has the item names and other item description along with other attributes. I have to match each item in table 1 with each description in table 2 to find maximum similarity matches. So basically, for each of 2 million items, I have to scan the other table to find best match. That makes 2 million * 2 million computations! How do I go about doing that in python efficiently? As it stands now, it will take years to compute.

Right now the approach I am following is regex search by splitting each item name into words in a list and then checking if the word is contained in description or not.If yes, then I increase the match count by 1 and using that I calculate similarity.

So my question(s) is :

  1. How to make my computations faster? Use multithreading, split data or anything like this?

  2. Any other similarity algorithm that will work here? Please note that I have description on the other side, so cosine similarity etc don't work because of differing number of words.

Upvotes: 2

Views: 652

Answers (2)

arshpreet
arshpreet

Reputation: 729

you can use NLTK as well.

from nltk import *
reference = 'DET NN VB DET JJ NN NN IN DET NN'.split()
test    = 'DET VB VB DET NN NN NN IN DET NN'.split()
print(accuracy(reference, test))
print edit_distance("rain", "shine")

Upvotes: 0

salomonderossi
salomonderossi

Reputation: 2188

You could try the Distance package to calculate the Levenshtein Distance for similarity.

From the documentation:

Comparing lists of strings can also be useful for computing similarities between sentences, paragraphs, etc., in articles or books, as for plagiarism recognition:

>>> sent1 = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
>>> sent2 = ['the', 'lazy', 'fox', 'jumps', 'over', 'the', 'crazy', 'dog']
>>> distance.levenshtein(sent1, sent2)
3

Or the python-Levenshtein package:

>>> distance('Levenshtein', 'Lenvinsten')
4

>>> distance('Levenshtein', 'Levensthein')
2
>>> distance('Levenshtein', 'Levenshten')
1
>>> distance('Levenshtein', 'Levenshtein')
0

Upvotes: 0

Related Questions