prooffreader
prooffreader

Reputation: 2453

Find overlap of two lists, preserving sequence order

I've found many methods of finding list intersections here, but I'm having trouble finding an efficient way to find the intersection when order is taken into account.

list1 = [1, 2, 3, 4, 5, 6, 7]
list2 = [7, 6, 3, 4, 5, 8]

The function should return [3, 4, 5]

I would already know there is only one overlapping sequence, and I would know its minimum length, but not its exact length.

Upvotes: 1

Views: 1414

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1125208

You are looking for the Longest Common Subsequence algorithm; the following uses dynamic programming to find the elements in O(NM) time (for sequences of length N and M):

def lcs(a, b):
    tbl = [[0 for _ in range(len(b) + 1)] for _ in range(len(a) + 1)]
    for i, x in enumerate(a):
        for j, y in enumerate(b):
            tbl[i + 1][j + 1] = tbl[i][j] + 1 if x == y else max(
                tbl[i + 1][j], tbl[i][j + 1])
    res = []
    i, j = len(a), len(b)
    while i and j:
        if tbl[i][j] == tbl[i - 1][j]:
            i -= 1
        elif tbl[i][j] == tbl[i][j - 1]:
            j -= 1
        else:
            res.append(a[i - 1])
            i -= 1
            j -= 1
    return res[::-1]

Demo:

>>> def lcs(a, b):
...     tbl = [[0 for _ in range(len(b) + 1)] for _ in range(len(a) + 1)]
...     for i, x in enumerate(a):
...         for j, y in enumerate(b):
...             tbl[i + 1][j + 1] = tbl[i][j] + 1 if x == y else max(
...                 tbl[i + 1][j], tbl[i][j + 1])
...     res = []
...     i, j = len(a), len(b)
...     while i and j:
...         if tbl[i][j] == tbl[i - 1][j]:
...             i -= 1
...         elif tbl[i][j] == tbl[i][j - 1]:
...             j -= 1
...         else:
...             res.append(a[i - 1])
...             i -= 1
...             j -= 1
...     return res[::-1]
... 
>>> list1 = [1, 2, 3, 4, 5, 6, 7]
>>> list2 = [7, 6, 3, 4, 5, 8]
>>> lcs(list1, list2)
[3, 4, 5]

This will find the subsequence regardless of location and if other elements are mixed in between:

>>> lcs([1, 2, 3, 4, 5, 6, 7], [7, 3, 6, 4, 8, 5])
[3, 4, 5]

Upvotes: 3

Related Questions