Efficiently yield elements from large list in (pseudo) random order

Question

I am experimenting with unrolling a few nested loops for (potentially) better performance at the expense of memory. In my scenario, I would end up with a list of about 300M elements (tuples), which I'd have to yield in (more or less) random order.

At this order of magnitude, random.shuffle(some_list) really is not the way to go anymore.

The below example illustrates the issue. Be aware, on an x86_64 Linux and CPython 3.6.4, it will eat about 11 GByte of memory.

def get_random_element():
    some_long_list = list(range(0, 300000000))
    for random_item in some_long_list:
        yield random_item

My thinking so far is to simply generate one random index per iteration and yield randomly picked elements (indefinitely) from the list. It may yield certain elements multiple times and totally skip others, which would be a trade-off worth considering.

What other options do I have within reasonable bounds of memory and CPU time to possibly yield every element of the list only once?

Severin Pappadeux · Accepted Answer

Here is Fisher-Yates-Knuth in-place sampling (https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle)

Memory was stable ~4Gb (yes, I was using 100000000)

# Fisher-Yates-Knuth sampling, in-place Durstenfeld version

import numpy as np

def swap(data, posA, posB):
    if posA != posB:
        data[posB], data[posA] = data[posA], data[posB]

def get_random_element(data, datalen):
    pos = datalen

    while pos > 0:
        idx = np.random.randint(low=0, high=pos) # sample in the [0...pos) range

        pos -= 1
        swap(data, idx, pos)

        yield data[pos]


length = 100000000
some_long_list = list(range(0, length))

gen = get_random_element(some_long_list, length)

for k in range(0, length):
    print(next(gen))

UPDATE

For speed, you might want to inline swap() as well

Efficiently yield elements from large list in (pseudo) random order

Answers (1)

Related Questions