Reputation: 1292
I have a sample of about 60000 data points and in an iterative algorithm, in each step, depending on some criterion, I either 'remove' (set to NaN) one of those data points or 'add' one of the previously removed data points back into the sample (set back to its original value). In order to avoid that the algorithm falls into an infinite loop, the sample should always be different in each iteration. I therefore keep track of the data points that are currently removed in each iteration and store the element indices in a list as follows:
Now in the current iteration 7, the algorithm suggests removing the element with array index 4, thus the new state data_state_temp would thus be [2,1,3]. Currently it checks whether it has seen the state so far via
flag_cycle = (data_state_temp in data_state_list)
The algorithm checks the new state for adding/removing different array elements until flag_cycle
is False
, then proceeds.
Apart from that it does not yet fully work, because the states [2,1,3] from iteration 7 and [2,3,1] from iteration 3 are the same, but the lists are not (need to sort them or better insert newly removed array elements into where they would belong in a sorted list), the problem is that the algorithm becomes very slow. In practice, e.g. data_state_temp
has length 15000, and data_state_list
has 40000 lists with generally increasing length up to 15000.
Questions:
data_state_temp
is in data_state_list
, does it compare list elements only for lists with lengths matching that of data_state_temp
(I would expect that) or would we need to manually select these lists beforehand?Upvotes: 0
Views: 146
Reputation: 1330
If you do not care about the order of past states -- just that it was ever visited before -- then lets amass a set
of all past states. This gives us a fixed lookup instead of linear lookup with in
(e.g. for needle in haystack
)
For a set
to work its magic, it has to work with hashable types. In short, a tuple
is immutable and therefore hashable and therefore a shoe-in candidate to be used with set
.
from itertools import chain
# let's say we already have state that is 15000 long
data_states = set()
data_state = tuple(range(15000))
data_states.add(data_state)
# we loop until we decide on a new state
while data_state in data_states:
# either you decide to remove an element, say 42
# (can skip sort because the previous tuple was already sorted)
new_data_state = tuple(x for x in data_state if x != 42)
# or you decide to add an element, 9000
new_data_state = tuple(sorted(x for x in chain(data_state, [9000])))
data_state = new_data_state
# now commit this state to your history
data_states.add(new_data_state)
Note that tuple is immutable, so we have to sort first before creating the tuple. Also note the sorted(_)
form creates a copy, whereas _.sort()
performs in-place sort. Since we are parsing previous state as a generator, the latter sort would not be possible as we do not have memory. So former is picked. Then that is fed into tuple()
constructor.
Given len(state) = n
:
new_data_state = tuple(sorted(x for x in chain(data_state, [9000])))
x for x in ...
: generator with n
operationssorted(...)
: creates a list, populates with n
entriestuple(...)
: creates a tuple, populates n
entriesSteps 1+2 happen in one pass of n
. Then repeat n
for Step 3. Total runtime complexity is O{2n}
per iteration of state
.
What I listed above. If you become strapped for memoery, you might consider more compact forms of representing state
.
For example, you might chunk state
into 3
chunks consisting of 5000
consecutive numbers. Then use a nested dictionary. E.g. data_states[(x for x in range(5000)][(x for x in range(5000)] == <all the remaining digits>
In this way, the space cost is amortized such that the incremental space cost for an element with 15777
digits is only 15777 - 5000 - 5000 = 5777
digits. This is essentially what a dictionary does.
Yes, it performs length matching as the first check to short-circuit a False
. But no, it still processes every single element.
Upvotes: 2