Python check if list is present in list of lists for detecting cycles

Question

I have a sample of about 60000 data points and in an iterative algorithm, in each step, depending on some criterion, I either 'remove' (set to NaN) one of those data points or 'add' one of the previously removed data points back into the sample (set back to its original value). In order to avoid that the algorithm falls into an infinite loop, the sample should always be different in each iteration. I therefore keep track of the data points that are currently removed in each iteration and store the element indices in a list as follows:

Iteration 1: data_state_list = [[2]] (element with array index 2 is removed)
Iteration 2: data_state_list = [[2],[2,3]] (element with array index 3 is removed)
Iteration 3: data_state_list = [[2],[2,3],[2,3,1]]
Iteration 4: data_state_list = [[2],[2,3],[2,3,1],[2,1]] (element with array index 3 is re-added)
Iteration 5: data_state_list = [[2],[2,3],[2,3,1],[2,1],[2,1,4]]
Iteration 6: data_state_list = [[2],[2,3],[2,3,1],[2,1],[2,1,4],[2,1,4,3]]

Now in the current iteration 7, the algorithm suggests removing the element with array index 4, thus the new state data_state_temp would thus be [2,1,3]. Currently it checks whether it has seen the state so far via

flag_cycle = (data_state_temp in data_state_list)

The algorithm checks the new state for adding/removing different array elements until flag_cycle is False, then proceeds.

Apart from that it does not yet fully work, because the states [2,1,3] from iteration 7 and [2,3,1] from iteration 3 are the same, but the lists are not (need to sort them or better insert newly removed array elements into where they would belong in a sorted list), the problem is that the algorithm becomes very slow. In practice, e.g. data_state_temp has length 15000, and data_state_list has 40000 lists with generally increasing length up to 15000.

Questions:

How can we get a speed-up for that cycle/infinite loop check? Other/Conceptually different approaches to checking whether we already had the same state previously are perfectly fine.
In the current code, when Python checks whether data_state_temp is in data_state_list, does it compare list elements only for lists with lengths matching that of data_state_temp (I would expect that) or would we need to manually select these lists beforehand?

Razzle Shazl · Accepted Answer

Keep History as Set (Fixed Lookup)

If you do not care about the order of past states -- just that it was ever visited before -- then lets amass a set of all past states. This gives us a fixed lookup instead of linear lookup with in (e.g. for needle in haystack)

For a set to work its magic, it has to work with hashable types. In short, a tuple is immutable and therefore hashable and therefore a shoe-in candidate to be used with set.

from itertools import chain

# let's say we already have state that is 15000 long
data_states = set()
data_state = tuple(range(15000))
data_states.add(data_state)

# we loop until we decide on a new state
while data_state in data_states:
    # either you decide to remove an element, say 42
    # (can skip sort because the previous tuple was already sorted)
    new_data_state = tuple(x for x in data_state if x != 42)

    # or you decide to add an element, 9000
    new_data_state = tuple(sorted(x for x in chain(data_state, [9000])))

    data_state = new_data_state

# now commit this state to your history
data_states.add(new_data_state)

Note that tuple is immutable, so we have to sort first before creating the tuple. Also note the sorted(_) form creates a copy, whereas _.sort() performs in-place sort. Since we are parsing previous state as a generator, the latter sort would not be possible as we do not have memory. So former is picked. Then that is fed into tuple() constructor.

Runtime Complexity

Given len(state) = n:

new_data_state = tuple(sorted(x for x in chain(data_state, [9000])))

x for x in ...: generator with n operations
sorted(...): creates a list, populates with n entries
tuple(...): creates a tuple, populates n entries

Steps 1+2 happen in one pass of n. Then repeat n for Step 3. Total runtime complexity is O{2n} per iteration of state.

Q: How to speed up?

What I listed above. If you become strapped for memoery, you might consider more compact forms of representing state.

For example, you might chunk state into 3 chunks consisting of 5000 consecutive numbers. Then use a nested dictionary. E.g. data_states[(x for x in range(5000)][(x for x in range(5000)] ==

In this way, the space cost is amortized such that the incremental space cost for an element with 15777 digits is only 15777 - 5000 - 5000 = 5777 digits. This is essentially what a dictionary does.

Q: Does it compare list elements by length?

Yes, it performs length matching as the first check to short-circuit a False. But no, it still processes every single element.

Python check if list is present in list of lists for detecting cycles

Answers (1)

Keep History as Set (Fixed Lookup)

Runtime Complexity

Q: How to speed up?

Q: Does it compare list elements by length?

Related Questions