Reputation: 3721
I am experimenting with unrolling a few nested loops for (potentially) better performance at the expense of memory. In my scenario, I would end up with a list of about 300M elements (tuples), which I'd have to yield in (more or less) random order.
At this order of magnitude, random.shuffle(some_list)
really is not the way to go anymore.
The below example illustrates the issue. Be aware, on an x86_64 Linux and CPython 3.6.4, it will eat about 11 GByte of memory.
def get_random_element():
some_long_list = list(range(0, 300000000))
for random_item in some_long_list:
yield random_item
My thinking so far is to simply generate one random index per iteration and yield randomly picked elements (indefinitely) from the list. It may yield certain elements multiple times and totally skip others, which would be a trade-off worth considering.
What other options do I have within reasonable bounds of memory and CPU time to possibly yield every element of the list only once?
Upvotes: 11
Views: 886
Reputation: 20080
Here is Fisher-Yates-Knuth in-place sampling (https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle)
Memory was stable ~4Gb (yes, I was using 100000000)
# Fisher-Yates-Knuth sampling, in-place Durstenfeld version
import numpy as np
def swap(data, posA, posB):
if posA != posB:
data[posB], data[posA] = data[posA], data[posB]
def get_random_element(data, datalen):
pos = datalen
while pos > 0:
idx = np.random.randint(low=0, high=pos) # sample in the [0...pos) range
pos -= 1
swap(data, idx, pos)
yield data[pos]
length = 100000000
some_long_list = list(range(0, length))
gen = get_random_element(some_long_list, length)
for k in range(0, length):
print(next(gen))
UPDATE
For speed, you might want to inline swap() as well
Upvotes: 4