Reputation: 12497
I have a system where information can come from various sources. I want to make sure I don't add exact (or extremely similar) pieces of information. Here is an example:
Text A: One day a man walked over the hill and saw the sun
Text B: One day a man walked over a hill and saw the sun
Text C: One week a woman looked over a hill and saw the sun
In this case I want to get some sort of numerical value for the difference between the blocks of information. From there I can apply the following logic:
Therefore we end up with different information in the database, and not duplicates, but we allow a small amount of leeway.
Can anyone tell me how I might attempt this in Python?
Upvotes: 0
Views: 2137
Reputation: 8536
A primitive way of doing this... but you could iterate through strings, comparing the equivalent sequential word in another string and you get a ratio of matches to fails:
>>> aa = 'One day a man walked over the hill and saw the sun'
>>> bb = 'One day a man walked over a hill and saw the sun'
>>> matches = [a == b for a, b in zip(aa.split(' '), bb.split(' '))]
>>> matches
[True, True, True, True, True, True, False, True, True, True, True, True]
>>> sum(matches)
11
>>> len(matches)
12
So in this example, you can see 11/12 words matched. You can then set a pass / fail level
Upvotes: 1
Reputation: 6832
There are a couple of python libraries that can help you with that. Have a look at this Q:.
The levisthein distance is a common algorithm. I found the nysiis algorithm very useful. Especially if you want to save a string representation in a DB.
This link will give you an excellent overview:
Upvotes: 1
Reputation: 63757
Looking at your problem, difflib.SequenceMatcher.ratio() might come handy.
This nifty routine, takes two strings and calculates a similarity index in the range [0,1]
>>> for a,b in list(itertools.product(st, st)):
print "Text 1 {}".format(a)
print "Text 2 {}".format(b)
print "Similarity Index {}".format(difflib.SequenceMatcher(None, a,b).ratio())
print '-'*80
Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Upvotes: 2
Reputation: 2561
In python or any other language hashes are the easiest way to remove duplicates.
You can maintain a table of already added hashes. when you add another just check if hash is present or not.
Use hashlib for it
Adding hashlib usage example
import hashlib
m1 = hashlib.md5()
m1.update(" the spammish repetition")
print m1.hexdigest()
m2 = hashlib.md5()
m2.update(" the spammish")
print m2.hexdigest()
m3 = hashlib.md5()
m3.update(" the spammish repetition")
print m3.hexdigest()
Ans
d21fe4d39740662f11ad2cf8035b471b
03498704df59a124ee6ac0681e64841b
d21fe4d39740662f11ad2cf8035b471b
Upvotes: 0