Reputation: 399
I have a python script scraping comments on a list of web page regularly and inserting them into a database. But it inserts a comment only it's not in the database yet. How feasible is it to store a hash of each comment along with its body to be able to look it up faster next time I'll need to check if it's already been inserted? Instead of storying only their bodies and comparing them word by word? If it's faster, what kind of hash should I use? Md5 or ....?
The avarage comment is about 1000 words. I'm aware that even a single character difference results in different hashes, that's ok.
Upvotes: 1
Views: 148
Reputation: 11933
You can use something like a Jaccard Index. This will even let you search for partial matches, you can set a threshold to reject or select matches (i.e. similar text)
You can even look for Minhashing, that would be space efficient way of doing Jaccard distance and you will have the benefit of a few character differences being matched and resulting in same bucket (Check out Locality Sensitive Hashing). You will have to set a threshold though, precision/recall problem is what you will have to tackle.
Upvotes: 3