Reputation: 83137

Find the most likely word alignment between two strings in Python

I have 2 similar strings. How can I find the most likely word alignment between these two strings in Python?

Example of input:

string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'

Desired output:

alignment['my']        = 'my'
alignment['channel']   = 'channel'
alignment['is']        = 'is'
alignment['youtube']   = 'youtube.com/example'
alignment['dot']       = 'youtube.com/example'
alignment['com']       = 'youtube.com/example'
alignment['slash']     = 'youtube.com/example'
alignment['example']   = 'youtube.com/example'
alignment['and']       = 'and'
alignment['then']      = 'then'
alignment['I']         = 'I'
alignment['also']      = 'also'
alignment['do']        = 'do'
alignment['live']      = 'livestreaming'
alignment['streaming'] = 'livestreaming'
alignment['on']        = 'on'
alignment['twitch']    = 'twitch'

Upvotes: 2

Answers (4)

Josiah Yoder

Reputation: 3756

For practical purposes, I recommend Python's difflib library.

It is NOT probabilistic, but it used good heuristics the first time I used it. My application appears similar to yours -- aligning transcripts with text. I was using it to visualize difference and the get_opcodes method gave me a good way to build various visualizations of the edits necessary to get from one to the other.

The documentation helps one get started:

a = "qabxcd"
b = "abycdf"
s = SequenceMatcher(None, a, b)
for tag, i1, i2, j1, j2 in s.get_opcodes():
    print('{:7}   a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(
        tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))> ```

Output:

delete    a[0:1] --> b[0:0]      'q' --> ''
equal     a[1:3] --> b[0:2]     'ab' --> 'ab'
replace   a[3:4] --> b[2:3]      'x' --> 'y'
equal     a[4:6] --> b[3:5]     'cd' --> 'cd'
insert    a[6:6] --> b[5:6]       '' --> 'f'

When you have text consisting of words, I'd recommend

import difflib
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'.split()
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'.split()
s = difflib.SequenceMatcher(None, string1, string2)
for tag, i1, i2, j1, j2 in s.get_opcodes():
    print('{:7}   a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(
        tag, i1, i2, j1, j2, string1[i1:i2], string2[j1:j2]))

The .split() on each string converts it into a list of words which then become your discrete tokens to be compared. Example output:

equal     a[0:3] --> b[0:3] ['my', 'channel', 'is'] --> ['my', 'channel', 'is']
replace   a[3:8] --> b[3:4] ['youtube', 'dot', 'com', 'slash', 'example'] --> ['youtube.com/example']
equal     a[8:13] --> b[4:9] ['and', 'then', 'I', 'also', 'do'] --> ['and', 'then', 'I', 'also', 'do']
replace   a[13:15] --> b[9:10] ['live', 'streaming'] --> ['livestreaming']
equal     a[15:17] --> b[10:12] ['on', 'twitch.'] --> ['on', 'twitch.']

Upvotes: 1

Toothpick Anemone

Reputation: 4644

The Needleman-Wunch Algorithm

Biologists sometimes try to align the DNA of two different plants or animals to see how much of their genome they share in common.

MOUSE: A  A  T  C  C  G  C  T  A  G  
RAT:   A  A  A  C  C  C  T  T  A  G  
       +  +  -  +  +  -  -  +  +  +

Above "+" means that pieces of DNA match.
Above "-" means that pieces of DNA mis-match.

You can use the full ASCII character set (128 characters) instead of the letters ATCG that biologists use.

I recommend using the the Needleman Wunsch Algorithm

Needle-Wunsch is not the fastest algorithm in the world.
However, Needle-Wunsch is easy to understand.

In cases were one string of English text is completely missing a word present in the other text, Needleman Wunsch will match the word to special "GAP" character.

+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
|  The  | reason | that  | I | went | to | the | store | was | to | buy |  some | food |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| <GAP> | reason | <GAP> | I | went |  2 | te  | store | wuz |  2 | buy | <GAP> | fud  |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+

The special GAP characters are fine.

However, what is in-efficient about Needle Wunsch is that people who wrote the algorithm believed that the order of the gap characters was important. The following are computed as two separate cases:

ALIGNMENT ONE

+---+-------+-------+---+---+
| A |   1   | <GAP> | R | A |
+---+-------+-------+---+---+
| A | <GAP> | B     | R | A |
+---+-------+-------+---+---+

ALIGNMENT TWO

+---+-------+-------+---+---+
| A | <GAP> | 1     | R | A |
+---+-------+-------+---+---+
| A | B     | <GAP> | R | A |
+---+-------+-------+---+---+

However, if you have two or more gaps in a row, then order of the gaps should not matter.

The Needleman-Wunch algorithm calculates the same thing many times over because whoever wrote the algorithm thought that order mattered a little more than it really does.

The following two alignments have the same score.

Also, both alignments have more or less the same meaning in the "real world" (outside of the computer).

However, the Needleman-Wunch algorithm will compute the scores of the two example alignments twice instead of computing it only one time.

Upvotes: 1

Erwan

Reputation: 1135

Previous answers offer biology-based alignment methods, there are NLP-based alignments methods as well. The most standard would be the Levenshtein edit distance. There are a few variants, and generally this problem is considered closely related to the question of text similarity measures (aka fuzzy matching, etc.). In particular it's possible to mix alignment at the level of word and characters. as well as different measures (e.g. SoftTFIDF, see this answer).

Upvotes: 3

fsimonjetz

Reputation: 5802

Alignment is tricky. spaCy can do it (see Aligning tokenization) but AFAIK it assumes that the two underlying strings are identical which is not the case here.

I used Bio.pairwise2 a few years back for a similar problem. I don't quite remember the exact settings, but here's what the default setup would give you:

from Bio import pairwise2
from Bio.pairwise2 import format_alignment


string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'

alignments = pairwise2.align.globalxx(string1.split(), 
                                      string2.split(),
                                      gap_char=['-']
                                     )

The resulting alignments - pretty close already:

>>> format_alignment(*alignments[0])
my channel is youtube dot com slash example          -          and then I also do live streaming       -       on twitch. 
 |    |     |                                                    |    |  |   |   |                               |    |    
my channel is    -     -   -    -      -    youtube.com/example and then I also do  -       -     livestreaming on twitch. 
  Score=10

You can provide your own matching functions, which would make fuzzywuzzy an interesting addition.