Jimmy
Jimmy

Reputation: 12497

Comparing two blocks of text in Python

I have a system where information can come from various sources. I want to make sure I don't add exact (or extremely similar) pieces of information. Here is an example:

Text A: One day a man walked over the hill and saw the sun

Text B: One day a man walked over a hill and saw the sun

Text C: One week a woman looked over a hill and saw the sun

In this case I want to get some sort of numerical value for the difference between the blocks of information. From there I can apply the following logic:

  1. When adding Text to database, check for existing values in database
  2. If values are seen to be very similar then do not add
  3. If values are seen to be different enough, then do add

Therefore we end up with different information in the database, and not duplicates, but we allow a small amount of leeway.

Can anyone tell me how I might attempt this in Python?

Upvotes: 0

Views: 2137

Answers (4)

Noel Evans
Noel Evans

Reputation: 8536

A primitive way of doing this... but you could iterate through strings, comparing the equivalent sequential word in another string and you get a ratio of matches to fails:

>>> aa = 'One day a man walked over the hill and saw the sun'
>>> bb = 'One day a man walked over a hill and saw the sun'
>>> matches = [a == b for a, b in zip(aa.split(' '), bb.split(' '))]
>>> matches
[True, True, True, True, True, True, False, True, True, True, True, True]
>>> sum(matches)
11
>>> len(matches)
12

So in this example, you can see 11/12 words matched. You can then set a pass / fail level

Upvotes: 1

LarsVegas
LarsVegas

Reputation: 6832

There are a couple of python libraries that can help you with that. Have a look at this Q:.

The levisthein distance is a common algorithm. I found the nysiis algorithm very useful. Especially if you want to save a string representation in a DB.

This link will give you an excellent overview:

Upvotes: 1

Abhijit
Abhijit

Reputation: 63757

Looking at your problem, difflib.SequenceMatcher.ratio() might come handy.

This nifty routine, takes two strings and calculates a similarity index in the range [0,1]

Quick Demo

>>> for a,b in list(itertools.product(st, st)):
    print "Text 1 {}".format(a)
    print "Text 2 {}".format(b)
    print "Similarity Index {}".format(difflib.SequenceMatcher(None, a,b).ratio())
    print '-'*80


Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One day a man walked over the hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.831683168317
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------
Text 1 One week a woman looked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over the hill and saw the sun
Similarity Index 0.959183673469
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One week a woman looked over a hill and saw the sun
Similarity Index 0.868686868687
--------------------------------------------------------------------------------
Text 1 One day a man walked over a hill and saw the sun
Text 2 One day a man walked over a hill and saw the sun
Similarity Index 1.0
--------------------------------------------------------------------------------

Upvotes: 2

duck
duck

Reputation: 2561

In python or any other language hashes are the easiest way to remove duplicates.

You can maintain a table of already added hashes. when you add another just check if hash is present or not.

Use hashlib for it

Adding hashlib usage example

import hashlib
m1 = hashlib.md5()
m1.update(" the spammish repetition")
print m1.hexdigest()

m2 = hashlib.md5()
m2.update(" the spammish")
print m2.hexdigest()

m3 = hashlib.md5()
m3.update(" the spammish repetition")
print m3.hexdigest()

Ans

d21fe4d39740662f11ad2cf8035b471b
03498704df59a124ee6ac0681e64841b
d21fe4d39740662f11ad2cf8035b471b

Upvotes: 0

Related Questions