bsr
bsr

Reputation: 58662

Algorithm to find the least difference between lists

I have been trying to understand the algorithm used here to compare two lists, implemented in this commit. The intention, as I understood, is to find the least amount of changes to create dst from src. These changes are later listed as sequence of patch commands. I am not a python developer, and learned generators to understand the flow and how recursion is done. but, now I can't make much sense out of the output generated by the _split_by_common_seq method. I fed a few different lists, and the output is shown below. Can you please help me to understand why the output is like it is in these cases.

in the reference case,

src [0, 1, 2, 3]
dst [1, 2, 4, 5]
[[(0, 1), None], [(3, 4), (2, 4)]]

I cannot see how it is related to the the picture in the doc. Why (3,4) and (2,4) on the right? Is it a standard algorithm?

test cases

src [1, 2, 3]
dst [1, 2, 3, 4, 5, 6, 7, 8]
[[None, None], [None, (3, 8)]]   

src [1, 2, 3, 4, 5]
dst [1, 2, 3, 4, 5, 6, 7, 8]
[[None, None], [None, (5, 8)]]

src [4, 5]
dst [1, 2, 3, 4, 5, 6, 7, 8]
[[None, (0, 3)], [None, (5, 8)]]

src [0, 1, 2, 3]
dst [1, 2, 4, 5]
[[(0, 1), None], [(3, 4), (2, 4)]]

src [0, 1, 2, 3]
dst [1, 2, 3, 4, 5]
[[(0, 1), None], [None, (3, 5)]]

src [0, 1, 3]
dst [1, 2, 4, 5]
[[(0, 1), None], [(2, 3), (1, 4)]] 

For future reference, here's the code (taken from the aforementioned repository):

import itertools

def _longest_common_subseq(src, dst):
    """Returns pair of ranges of longest common subsequence for the `src`
and `dst` lists.

>>> src = [1, 2, 3, 4]
>>> dst = [0, 1, 2, 3, 5]
>>> # The longest common subsequence for these lists is [1, 2, 3]
... # which is located at (0, 3) index range for src list and (1, 4) for
... # dst one. Tuple of these ranges we should get back.
... assert ((0, 3), (1, 4)) == _longest_common_subseq(src, dst)
"""
    lsrc, ldst = len(src), len(dst)
    drange = list(range(ldst))
    matrix = [[0] * ldst for _ in range(lsrc)]
    z = 0 # length of the longest subsequence
    range_src, range_dst = None, None
    for i, j in itertools.product(range(lsrc), drange):
        if src[i] == dst[j]:
            if i == 0 or j == 0:
                matrix[i][j] = 1
            else:
                matrix[i][j] = matrix[i-1][j-1] + 1
            if matrix[i][j] > z:
                z = matrix[i][j]
            if matrix[i][j] == z:
                range_src = (i-z+1, i+1)
                range_dst = (j-z+1, j+1)
        else:
            matrix[i][j] = 0
    return range_src, range_dst

def split_by_common_seq(src, dst, bx=(0, -1), by=(0, -1)):
    """Recursively splits the `dst` list onto two parts: left and right.
The left part contains differences on left from common subsequence,
same as the right part by for other side.

To easily understand the process let's take two lists: [0, 1, 2, 3] as
`src` and [1, 2, 4, 5] for `dst`. If we've tried to generate the binary tree
where nodes are common subsequence for both lists, leaves on the left
side are subsequence for `src` list and leaves on the right one for `dst`,
our tree would looks like::

[1, 2]
/ \
[0] []
/ \
[3] [4, 5]

This function generate the similar structure as flat tree, but without
nodes with common subsequences - since we're don't need them - only with
left and right leaves::

[]
/ \
[0] []
/ \
[3] [4, 5]

The `bx` is the absolute range for currently processed subsequence of
`src` list. The `by` means the same, but for the `dst` list.
"""
    # Prevent useless comparisons in future
    bx = bx if bx[0] != bx[1] else None
    by = by if by[0] != by[1] else None

    if not src:
        return [None, by]
    elif not dst:
        return [bx, None]

    # note that these ranges are relative for processed sublists
    x, y = _longest_common_subseq(src, dst)

    if x is None or y is None: # no more any common subsequence
        return [bx, by]

    return [split_by_common_seq(src[:x[0]], dst[:y[0]],
                                 (bx[0], bx[0] + x[0]),
                                 (by[0], by[0] + y[0])),
            split_by_common_seq(src[x[1]:], dst[y[1]:],
                                 (bx[0] + x[1], bx[0] + len(src)),
                                 (bx[0] + y[1], bx[0] + len(dst)))]

Upvotes: 3

Views: 398

Answers (1)

Filipe Gonçalves
Filipe Gonçalves

Reputation: 21213

It is a cute algorithm, but I don't think it's a "known" one. It's a clever way of comparing lists, and probably not the first time that someone thought of it, but I had never seen it before.

Basically, the output is telling you the ranges that look different in src and dst.

The function always returns a list with 2 lists. The first list refers to the elements in src and dst that are on the left side of the longest common subsequence between src and dst; the second refers to the elements that are on the right side of the longest common subsequence. Each of these lists holds a pair of tuples. Tuples represent a range in the list - (x, y) denotes the elements you would get if you performed lst[x:y]. From this pair of tuples, the first tuple is the range from src, the second tuple is the range from dst.

At each step, the algorithm computes the ranges of src and dst that are to the left of the longest common subsequence and to the right of the longest common subsequence between src and dst.

Let's look at your first example to clear things up:

src [0, 1, 2, 3]
dst [1, 2, 4, 5]

The longest common subsequence between src and dst is [1, 2]. In src, the range (0, 1) defines the elements that are immediately to the left of [1, 2]; in dst, that range is empty, because there is nothing before [1, 2]. So, the first list will be [(0, 1), None].

To the right of [1, 2], in src, we have the elements in the range (3, 4), and in dst we have 4 and 5, which are represented by the range (2, 4). So the second list will be [(3, 4), (2, 4)].

And there you go:

[[(0, 1), None], [(3, 4), (2, 4)]]

How does this relate to the tree in the comments?

The leafs in the tree are using a different notation: instead of a tuple describing a range, the actual elements on that range are shown. In fact, [0] is the only element in the range (0, 1) in src. The same applies for the rest.

Once you get this, the other examples you posted should be pretty easy to follow. But note that the output can become more complex if there is more than one common subsequence: the algorithm finds every common subsequences in nonincreasing order; since each invocation returns a list with 2 elements, this means that you will get nested lists in cases like these. Consider:

src = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
dst = [46, 1, 2, 3, 4, 5, 99, 98, 97, 5, 6, 7, 30, 31, 32, 11, 12, 956]

This outputs:

[[(0, 1), (0, 1)], [[[None, (6, 10)], [(8, 11), (12, 15)]], [(13, 14), (17, 18)]]]

The second list is nested because there was more than one recursion level (your previous examples immediately fell on a base case).

The explanation shown before applies recursively to each list: the second list in [[(0, 1), (0, 1)], [[[None, (6, 10)], [(8, 11), (12, 15)]], [(13, 14), (17, 18)]]] shows the differences in the lists to the right of the longest common subsequence.

The longest common subsequence is [1, 2, 3, 4, 5]. To the left of [1, 2, 3, 4, 5], both lists are different in the first element (the ranges are equal and easy to check).

Now, the procedure applies recursively. For the right side, there is a new recursive call, and src and dst become:

src = [6, 7, 8, 9, 10, 11, 12, 13]
dst = [99, 98, 97, 5, 6, 7, 30, 31, 32, 11, 12, 956]

    # LCS = [6, 7]; Call on the left
        src = []
        dst = [99, 98, 97, 5]
    # LCS = [6, 7]; Call on the right
        src = [8, 9, 10, 11, 12, 13]
        dst = [30, 31, 32, 11, 12, 956]
        # LCS = [11, 12]; Call on the left
            src = [8, 9, 10]
            dst = [30, 31, 32]
        # LCS = [11, 12]; Call on the right
            src = [13]
            dst = [956]

The longest common subsequence is [6, 7]. Then you will have another recursive call on the left, for src = [] and dst = [99, 98, 97, 5], now there is no longest common subsequence and the recursion on this side stops (just follow the picture).

Each nested list recursively represents the differences on the sub-list with which the procedure was invoked, but note that the indices always refer to positions in the original list (due to the way arguments for bx and by are passed - note that they always accumulate since the beginning).

The key point here is that you will get nested lists linearly proportional to the depth of the recursion, and in fact, you can actually tell how many common subsequences exist in the original lists just by looking at the nesting level.

Upvotes: 3

Related Questions