user2626682
user2626682

Reputation: 1499

Python - difference between two strings

I'd like to store a lot of words in a list. Many of these words are very similar. For example I have word afrykanerskojęzyczny and many of words like afrykanerskojęzycznym, afrykanerskojęzyczni, nieafrykanerskojęzyczni. What is the effective (fast and giving small diff size) solution to find difference between two strings and restore second string from the first one and diff?

Upvotes: 149

Views: 333934

Answers (7)

dawg
dawg

Reputation: 103714

You can use ndiff in the difflib module to do this. It has all the information necessary to convert one string into another string.

A simple example:

import difflib

cases=[('afrykanerskojęzyczny', 'afrykanerskojęzycznym'),
       ('afrykanerskojęzyczni', 'nieafrykanerskojęzyczni'),
       ('afrykanerskojęzycznym', 'afrykanerskojęzyczny'),
       ('nieafrykanerskojęzyczni', 'afrykanerskojęzyczni'),
       ('nieafrynerskojęzyczni', 'afrykanerskojzyczni'),
       ('abcdefg','xac')] 

for a,b in cases:     
    print('{} => {}'.format(a,b))  
    for i,s in enumerate(difflib.ndiff(a, b)):
        if s[0]==' ': continue
        elif s[0]=='-':
            print(u'Delete "{}" from position {}'.format(s[-1],i))
        elif s[0]=='+':
            print(u'Add "{}" to position {}'.format(s[-1],i))    
    print()      

prints:

afrykanerskojęzyczny => afrykanerskojęzycznym
Add "m" to position 20

afrykanerskojęzyczni => nieafrykanerskojęzyczni
Add "n" to position 0
Add "i" to position 1
Add "e" to position 2

afrykanerskojęzycznym => afrykanerskojęzyczny
Delete "m" from position 20

nieafrykanerskojęzyczni => afrykanerskojęzyczni
Delete "n" from position 0
Delete "i" from position 1
Delete "e" from position 2

nieafrynerskojęzyczni => afrykanerskojzyczni
Delete "n" from position 0
Delete "i" from position 1
Delete "e" from position 2
Add "k" to position 7
Add "a" to position 8
Delete "ę" from position 16

abcdefg => xac
Add "x" to position 0
Delete "b" from position 2
Delete "d" from position 4
Delete "e" from position 5
Delete "f" from position 6
Delete "g" from position 7

Upvotes: 178

greenkey
greenkey

Reputation: 421

I found this problem interesting and even if this is an old question, I wanted to try developing a solution without any external library.

import pytest

@pytest.mark.parametrize(
    "s1,s2,expected",
    [
        ("", "", ["", ""]),
        ("a", "a", ["", ""]),
        ("a", "b", ["a", "b"]),
        ("ac", "bc", ["a", "b"]),
        ("ca", "cb", ["a", "b"]),
        ("this is a message", "this is b message", ["a", "b"]),
        ("this is a huge message", "this is b message", ["a huge", "b"]),
    ],
)
def test_string_diff(s1, s2, expected):
    assert string_diff(s1, s2) == expected


def string_diff(s1: str, s2: str) -> str:
    d1, d2 = [], []
    i1 = i2 = 0
    l1 = len(s1)
    l2 = len(s2)
    while True:
        if i1 >= len(s1) or i2 >= len(s2):
            d1.extend(s1[i1:])
            d2.extend(s2[i2:])
            break
        if s1[i1] != s2[i2]:
            e1 = l1 - i1
            e2 = l2 - i2
            if e1 > e2:
                d1.append(s1[i1])
                i2 -= 1
            elif e1 < e2:
                d2.append(s2[i2])
                i1 -= 1
            else:
                d1.append(s1[i1])
                d2.append(s2[i2])
        i1 += 1
        i2 += 1
    return ["".join(d1), "".join(d2)]

Upvotes: 0

moo
moo

Reputation: 2175

You might find the tools available in the NLTK library useful for calculating the difference between different words.

nltk.metrics.distance.edit_distance() is a mature (non-standard) library implementation that calculates the Levenshtein distance

A simple example might be:

from nltk.metrics.distance import *

w1 = 'wordone'
w2 = 'wordtwo'
edit_distance(w1, w2)

Out: 3

Additional parameter allow the output to be weighted, depending on the costs of different actions (substitutions/insertions) and different character differences (e.g. less cost for characters closer on the keyboard).

Upvotes: 2

Elias Benevedes
Elias Benevedes

Reputation: 391

The answer to my comment above on the Original Question makes me think this is all he wants:

loopnum = 0
word = 'afrykanerskojęzyczny'
wordlist = ['afrykanerskojęzycznym','afrykanerskojęzyczni','nieafrykanerskojęzyczni']
for i in wordlist:
    wordlist[loopnum] = word
    loopnum += 1

This will do the following:

For every value in wordlist, set that value of the wordlist to the origional code.

All you have to do is put this piece of code where you need to change wordlist, making sure you store the words you need to change in wordlist, and that the original word is correct.

Upvotes: 0

Eric
Eric

Reputation: 749

I like the ndiff answer, but if you want to spit it all into a list of only the changes, you could do something like:

import difflib

case_a = 'afrykbnerskojęzyczny'
case_b = 'afrykanerskojęzycznym'

output_list = [li for li in difflib.ndiff(case_a, case_b) if li[0] != ' ']

Upvotes: 56

Craig Silverstein
Craig Silverstein

Reputation: 119

What you are asking for is a specialized form of compression. xdelta3 was designed for this particular kind of compression, and there's a python binding for it, but you could probably get away with using zlib directly. You'd want to use zlib.compressobj and zlib.decompressobj with the zdict parameter set to your "base word", e.g. afrykanerskojęzyczny.

Caveats are zdict is only supported in python 3.3 and higher, and it's easiest to code if you have the same "base word" for all your diffs, which may or may not be what you want.

Upvotes: 2

perreal
perreal

Reputation: 97918

You can look into the regex module (the fuzzy section). I don't know if you can get the actual differences, but at least you can specify allowed number of different types of changes like insert, delete, and substitutions:

import regex
sequence = 'afrykanerskojezyczny'
queries = [ 'afrykanerskojezycznym', 'afrykanerskojezyczni', 
            'nieafrykanerskojezyczni' ]
for q in queries:
    m = regex.search(r'(%s){e<=2}'%q, sequence)
    print 'match' if m else 'nomatch'

Upvotes: 5

Related Questions