Reputation: 4766

Find specific sublists in a list

Let's say that we have the following list:

sequence = ['2', '4', '1', '2', '3', '4', '2', '4', '2', '4', '4']
#indices     0    1    2    3    4    5    6    7    8    9    10

Next, we have the following list:

key_list = ['2', '2', '4']

Now, I want to extract all the possible sub-lists from sequence that preserve the order of keylist, i.e. its indices.

Let me explain by example. So, for sequence, all the possible sub-lists of indices which preserve the order of key_list are:

[0, 3, 5]
[0, 3, 7]
[0, 3, 9]
[0, 3, 10]

[0, 6, 7]
[0, 6, 9]
[0, 6, 10]

[0, 8, 9]
[0, 8, 10]

[3, 6, 7]
[3, 6, 9]
[3, 6, 10]

[3, 8, 9]
[3, 8, 10]

[6, 8, 9]
[6, 8, 10]

Any suggestions?

EDIT: I am working with a big dataset and I have to perform this for every line of the file, so I am looking for a very optimal way to do this, by avoiding brute-force approach (making all possible combinations of the sequence)

P.S. I don't know if the title of the question is appropriate, feel free to change it if you have a better one.

Upvotes: 3

Answers (6)

Paddy3118

Reputation: 4772

This second answer of mine finds the indices of all the keys in the sequence once then uses a recursive iterator (Python 2.x/3.x so I didn't use yield from, to find the possible index combinations:

from collections import defaultdict

sequence = ['2', '4', '1', '2', '3', '4', '2', '4', '2', '4', '4']
key_list = ['2', '2', '4']
keys = set(key_list)

key_indices = defaultdict(list)
for indx, val in enumerate(sequence):
    if val in keys:
        key_indices[val].append(indx)
print('key_indices =', key_indices)

def expander(keysleft, indices, sofar=None):
    #print('  keysleft, sofar =', keysleft, sofar )
    if sofar is None :
        sofar = []
    if sofar == []:
        indxleft = -1
    else:
        indxleft = sofar[-1]
    if keysleft:
        keyval, keyrest = keysleft[0], keysleft[1:]
        for keyindx in indices[keyval]:
            if keyindx > indxleft:
                if not keyrest:
                    # No more to do so
                    yield tuple(sofar + [keyindx])
                else:
                    for x in expander(keyrest, indices, sofar + [keyindx]):
                        yield x

ans = list(expander(key_list, dict(key_indices)))
print(ans)

Output:

key_indices = defaultdict(<class 'list'>, {'4': [1, 5, 7, 9, 10], '2': [0, 3, 6, 8]})
[(0, 3, 5), (0, 3, 7), (0, 3, 9), (0, 3, 10), (0, 6, 7), (0, 6, 9), (0, 6, 10), (0, 8, 9), (0, 8, 10), (3, 6, 7), (3, 6, 9), (3, 6, 10), (3, 8, 9), (3, 8, 10), (6, 8, 9), (6, 8, 10)]

Upvotes: 1

j_random_hacker

Reputation: 51226

This is the Longest Common Subsequence problem in a thin disguise. The only differences from the usual formulation are that you want the positions, rather than the characters themselves, and that you assume that the key_list sequence appears in its entirety as a subsequence of sequence, while the LCS problem doesn't make this assumption.

The LCS problem, which is closely related to the problem of optimally aligning two sequences (e.g. DNA sequences), can be solved in O(n^2) time using the Needleman-Wunsch dynamic programming algorithm, but that only gives you one solution; in the worst case, it could take exponentially long to enumerate all of them (consider the case of looking for a list of k 1s in a list of 2k 1s for large k; there are (2k choose k) answers). That said, it's just as easy to get the positions as the characters from the DP matrix, and it's straightforward to enumerate all solutions instead of a single one: as you backtrack through the DP matrix, whenever you encounter a cell for which two or all three in-edges have (equal) maximum score (instead of just one in-edge being the unique maximum), recurse to process all of them instead of picking an arbitrary one.

Incidentally, if key_list doesn't appear as a subsequence anywhere in sequence then the LCS algorithm will find the location of all "nearest" matches -- those positions at which the fewest characters are missing. This may or may not be useful to you.

Upvotes: 1

Paddy3118

Reputation: 4772

This expands on Sebastiens answer by recognising that you don't need any of the members of sequence that are not in key_list (now a key_tuple), as long as you preserve the original indices of what is left:

>>> from itertools import combinations
>>> sequence = ['2', '4', '1', '2', '3', '4', '2', '4', '2', '4', '4']
>>> key_tuple = ('2', '2', '4')
>>> keys = set(key_tuple)
>>> seq = [(indx, val) for indx, val in enumerate(sequence) if val in keys]
>>> seq
[(0, '2'), (1, '4'), (3, '2'), (5, '4'), (6, '2'), (7, '4'), (8, '2'), (9, '4'), (10, '4')]
>>> answer = []
>>> for c in combinations(seq, len(key_tuple)):
...     indxs, vals = zip(*c)
...     if vals == key_tuple:
...         answer.append(indxs)
... 
>>> answer
[(0, 3, 5), (0, 3, 7), (0, 3, 9), (0, 3, 10), (0, 6, 7), (0, 6, 9), (0, 6, 10), (0, 8, 9), (0, 8, 10), (3, 6, 7), (3, 6, 9), (3, 6, 10), (3, 8, 9), (3, 8, 10), (6, 8, 9), (6, 8, 10)]
>>>

Upvotes: 1

koffein

Reputation: 1882

Here is a recursive approach.

I look up every index of the first key. Then I use the same function to look up the following keys and join all the indices...

def indexLists(sequence, key_list, seq_start=0, key_start=0):
     """
         seq_start - where I start looking up in sequence
         key_start - which key I am looking up: key = key_list[key_start]
     """
     keyIndexes = []

     # I look up all indices of key_list[key_start] that are higher than seq_start
     while True:

         try:
             keyIndexes.append(
                  sequence.index(
                     key_list[key_start],# what I want to look up
                     keyIndexes[-1]+1 if keyIndexes else seq_start # starting after the last entry or seq_start
                  )
              )
         except:
             break # if there is an error, the are no more indices

     # if there are more entries in key_list
     if key_start+1 < len(key_list):
         # I look up the possible indexes of the following key(s) and combine them
         return [(keyIndex,)+nextKeys  for keyIndex in keyIndexes for nextKeys in indexLists(sequence, key_list, keyIndex+1, key_start+1)]
     else:
         # for the last key in key_list i just return all possible keyIndexes as 1-tuples
         return [(keyIndex, ) for keyIndex in keyIndexes]

Example:

sequence = ['2', '4', '1', '2', '3', '4', '2', '4', '2', '4', '4']
key_list = ['2', '2', '4']

indexLists(sequence, key_list)
Out[37]: 
[(0, 3, 5),
 (0, 3, 7),
 (0, 3, 9),
 (0, 3, 10),
 (0, 6, 7),
 (0, 6, 9),
 (0, 6, 10),
 (0, 8, 9),
 (0, 8, 10),
 (3, 6, 7),
 (3, 6, 9),
 (3, 6, 10),
 (3, 8, 9),
 (3, 8, 10),
 (6, 8, 9),
 (6, 8, 10)]

Upvotes: 1

Seb D.

Reputation: 5195

It probably needs some optimization and maybe a better structure than a list of lists to avoid the stupid copying and inserting that I'm doing right now, but I think this should do the trick in worst complexity len(sequence)^2 (not sure about the complexity though).

sequence = ['2', '4', '1', '2', '3', '4', '2', '4', '2', '4', '4']
key_list = ['2', '2', '4']

sub_lists = []
final_sub_lists = set()
len_key_list = len(key_list)

for index, value in enumerate(sequence):
    for sub_list in sub_lists:
        len_sub_list = len(sub_list)
        # Test if current value can continue the current sub list
        if len_sub_list < len_key_list and key_list[len_sub_list] == value:
            if len_sub_list == len_key_list - 1:
                # We have found a complete sub list
                final_sub_lists.add(tuple(sub_list + [index]))
            else:
                # We copy the current sub list to be sure not miss any sub lists
                # like for instance (6, 8, 9) and (6, 8, 10).
                sub_lists.insert(0, sub_list[:])
                sub_list.append(index)
    if key_list[0] == value:
        # Start a new sub list
        sub_lists.append([index])

print sorted(final_sub_lists)

Explanation:sub_lists is a list of list containing the indexes that have matched so far. When a sub_list has matched all values of key_list, it's appended to the set final_sub_lists.

It's not fully tested, so feel free to correct or point out optimizations!

Upvotes: 1

Ashwini Chaudhary

Reputation: 250961

You can use itertools.combinations for this. Apply combinations() on enumerate(sequence)(with r=len(key_list)) to get all r-length combinations from the list, and since enumerate() returns both index as well as item we can easily get the indices here:

>>> from itertools import combinations               
>>> for c in combinations(enumerate(sequence), len(key_list)):
    indices, data = zip(*c)
    if list(data) == key_list:
        print indices
...         
(0, 3, 5)
(0, 3, 7)
(0, 3, 9)
(0, 3, 10)
(0, 6, 7)
(0, 6, 9)
(0, 6, 10)
(0, 8, 9)
(0, 8, 10)
(3, 6, 7)
(3, 6, 9)
(3, 6, 10)
(3, 8, 9)
(3, 8, 10)
(6, 8, 9)
(6, 8, 10)

Upvotes: 6

Find specific sublists in a list

Answers (6)

Related Questions