Reputation: 4766
Let's say that we have the following list:
sequence = ['2', '4', '1', '2', '3', '4', '2', '4', '2', '4', '4']
#indices 0 1 2 3 4 5 6 7 8 9 10
Next, we have the following list:
key_list = ['2', '2', '4']
Now, I want to extract all the possible sub-lists from sequence
that preserve the order of keylist
, i.e. its indices.
Let me explain by example. So, for sequence
, all the possible sub-lists of indices which preserve the order of key_list
are:
[0, 3, 5]
[0, 3, 7]
[0, 3, 9]
[0, 3, 10]
[0, 6, 7]
[0, 6, 9]
[0, 6, 10]
[0, 8, 9]
[0, 8, 10]
[3, 6, 7]
[3, 6, 9]
[3, 6, 10]
[3, 8, 9]
[3, 8, 10]
[6, 8, 9]
[6, 8, 10]
Any suggestions?
EDIT: I am working with a big dataset and I have to perform this for every line of the file, so I am looking for a very optimal way to do this, by avoiding brute-force approach (making all possible combinations of the sequence)
P.S. I don't know if the title of the question is appropriate, feel free to change it if you have a better one.
Upvotes: 3
Views: 194
Reputation: 4772
This second answer of mine finds the indices of all the keys in the sequence once then uses a recursive iterator (Python 2.x/3.x so I didn't use yield from
, to find the possible index combinations:
from collections import defaultdict
sequence = ['2', '4', '1', '2', '3', '4', '2', '4', '2', '4', '4']
key_list = ['2', '2', '4']
keys = set(key_list)
key_indices = defaultdict(list)
for indx, val in enumerate(sequence):
if val in keys:
key_indices[val].append(indx)
print('key_indices =', key_indices)
def expander(keysleft, indices, sofar=None):
#print(' keysleft, sofar =', keysleft, sofar )
if sofar is None :
sofar = []
if sofar == []:
indxleft = -1
else:
indxleft = sofar[-1]
if keysleft:
keyval, keyrest = keysleft[0], keysleft[1:]
for keyindx in indices[keyval]:
if keyindx > indxleft:
if not keyrest:
# No more to do so
yield tuple(sofar + [keyindx])
else:
for x in expander(keyrest, indices, sofar + [keyindx]):
yield x
ans = list(expander(key_list, dict(key_indices)))
print(ans)
Output:
key_indices = defaultdict(<class 'list'>, {'4': [1, 5, 7, 9, 10], '2': [0, 3, 6, 8]})
[(0, 3, 5), (0, 3, 7), (0, 3, 9), (0, 3, 10), (0, 6, 7), (0, 6, 9), (0, 6, 10), (0, 8, 9), (0, 8, 10), (3, 6, 7), (3, 6, 9), (3, 6, 10), (3, 8, 9), (3, 8, 10), (6, 8, 9), (6, 8, 10)]
Upvotes: 1
Reputation: 51226
This is the Longest Common Subsequence problem in a thin disguise. The only differences from the usual formulation are that you want the positions, rather than the characters themselves, and that you assume that the key_list
sequence appears in its entirety as a subsequence of sequence
, while the LCS problem doesn't make this assumption.
The LCS problem, which is closely related to the problem of optimally aligning two sequences (e.g. DNA sequences), can be solved in O(n^2) time using the Needleman-Wunsch dynamic programming algorithm, but that only gives you one solution; in the worst case, it could take exponentially long to enumerate all of them (consider the case of looking for a list of k 1s in a list of 2k 1s for large k; there are (2k choose k) answers). That said, it's just as easy to get the positions as the characters from the DP matrix, and it's straightforward to enumerate all solutions instead of a single one: as you backtrack through the DP matrix, whenever you encounter a cell for which two or all three in-edges have (equal) maximum score (instead of just one in-edge being the unique maximum), recurse to process all of them instead of picking an arbitrary one.
Incidentally, if key_list
doesn't appear as a subsequence anywhere in sequence
then the LCS algorithm will find the location of all "nearest" matches -- those positions at which the fewest characters are missing. This may or may not be useful to you.
Upvotes: 1
Reputation: 4772
This expands on Sebastiens answer by recognising that you don't need any of the members of sequence that are not in key_list (now a key_tuple
), as long as you preserve the original indices of what is left:
>>> from itertools import combinations
>>> sequence = ['2', '4', '1', '2', '3', '4', '2', '4', '2', '4', '4']
>>> key_tuple = ('2', '2', '4')
>>> keys = set(key_tuple)
>>> seq = [(indx, val) for indx, val in enumerate(sequence) if val in keys]
>>> seq
[(0, '2'), (1, '4'), (3, '2'), (5, '4'), (6, '2'), (7, '4'), (8, '2'), (9, '4'), (10, '4')]
>>> answer = []
>>> for c in combinations(seq, len(key_tuple)):
... indxs, vals = zip(*c)
... if vals == key_tuple:
... answer.append(indxs)
...
>>> answer
[(0, 3, 5), (0, 3, 7), (0, 3, 9), (0, 3, 10), (0, 6, 7), (0, 6, 9), (0, 6, 10), (0, 8, 9), (0, 8, 10), (3, 6, 7), (3, 6, 9), (3, 6, 10), (3, 8, 9), (3, 8, 10), (6, 8, 9), (6, 8, 10)]
>>>
Upvotes: 1
Reputation: 1882
Here is a recursive approach.
I look up every index of the first key. Then I use the same function to look up the following keys and join all the indices...
def indexLists(sequence, key_list, seq_start=0, key_start=0):
"""
seq_start - where I start looking up in sequence
key_start - which key I am looking up: key = key_list[key_start]
"""
keyIndexes = []
# I look up all indices of key_list[key_start] that are higher than seq_start
while True:
try:
keyIndexes.append(
sequence.index(
key_list[key_start],# what I want to look up
keyIndexes[-1]+1 if keyIndexes else seq_start # starting after the last entry or seq_start
)
)
except:
break # if there is an error, the are no more indices
# if there are more entries in key_list
if key_start+1 < len(key_list):
# I look up the possible indexes of the following key(s) and combine them
return [(keyIndex,)+nextKeys for keyIndex in keyIndexes for nextKeys in indexLists(sequence, key_list, keyIndex+1, key_start+1)]
else:
# for the last key in key_list i just return all possible keyIndexes as 1-tuples
return [(keyIndex, ) for keyIndex in keyIndexes]
Example:
sequence = ['2', '4', '1', '2', '3', '4', '2', '4', '2', '4', '4']
key_list = ['2', '2', '4']
indexLists(sequence, key_list)
Out[37]:
[(0, 3, 5),
(0, 3, 7),
(0, 3, 9),
(0, 3, 10),
(0, 6, 7),
(0, 6, 9),
(0, 6, 10),
(0, 8, 9),
(0, 8, 10),
(3, 6, 7),
(3, 6, 9),
(3, 6, 10),
(3, 8, 9),
(3, 8, 10),
(6, 8, 9),
(6, 8, 10)]
Upvotes: 1
Reputation: 5195
It probably needs some optimization and maybe a better structure than a list of lists to avoid the stupid copying and inserting that I'm doing right now, but I think this should do the trick in worst complexity len(sequence)^2
(not sure about the complexity though).
sequence = ['2', '4', '1', '2', '3', '4', '2', '4', '2', '4', '4']
key_list = ['2', '2', '4']
sub_lists = []
final_sub_lists = set()
len_key_list = len(key_list)
for index, value in enumerate(sequence):
for sub_list in sub_lists:
len_sub_list = len(sub_list)
# Test if current value can continue the current sub list
if len_sub_list < len_key_list and key_list[len_sub_list] == value:
if len_sub_list == len_key_list - 1:
# We have found a complete sub list
final_sub_lists.add(tuple(sub_list + [index]))
else:
# We copy the current sub list to be sure not miss any sub lists
# like for instance (6, 8, 9) and (6, 8, 10).
sub_lists.insert(0, sub_list[:])
sub_list.append(index)
if key_list[0] == value:
# Start a new sub list
sub_lists.append([index])
print sorted(final_sub_lists)
Explanation:sub_lists
is a list of list containing the indexes that have matched so far. When a sub_list
has matched all values of key_list
, it's appended to the set final_sub_lists
.
It's not fully tested, so feel free to correct or point out optimizations!
Upvotes: 1
Reputation: 250961
You can use itertools.combinations
for this. Apply combinations()
on enumerate(sequence)
(with r=len(key_list)
) to get all r-length combinations from the list, and since enumerate()
returns both index as well as item we can easily get the indices here:
>>> from itertools import combinations
>>> for c in combinations(enumerate(sequence), len(key_list)):
indices, data = zip(*c)
if list(data) == key_list:
print indices
...
(0, 3, 5)
(0, 3, 7)
(0, 3, 9)
(0, 3, 10)
(0, 6, 7)
(0, 6, 9)
(0, 6, 10)
(0, 8, 9)
(0, 8, 10)
(3, 6, 7)
(3, 6, 9)
(3, 6, 10)
(3, 8, 9)
(3, 8, 10)
(6, 8, 9)
(6, 8, 10)
Upvotes: 6