Juan Imbett
Juan Imbett

Reputation: 187

Efficient way to extract and collect a random subsample of a generator in Julia

Consider a generator in Julia that if collected will take a lot of memory

g=(x^2 for x=1:9999999999999999)

I want to take a random small subsample (Say 1%) of it, but I do not want to collect() the object because will take a lot of memory

Until now the trick I was using was this

temp=collect((( rand()>0.01 ? nothing : x ) for x in g))
random_sample= temp[temp.!=nothing]

But this is not efficient for generators with a lot of elements, collecting something with so many nothing elements doesnt seem right

Any idea is highly appreciated. I guess the trick is to be able to get random elements from the generator without having to allocate memory for all of it.

Thank you very much

Upvotes: 2

Views: 161

Answers (1)

Bogumił Kamiński
Bogumił Kamiński

Reputation: 69869

You can use a generator with if condition like this:

[v for v in g if rand() < 0.01]

or if you want a bit faster, but more verbose approach (I have hardcoded 0.01 and element type of g and I assume that your generator supports length - otherwise you can remove sizehint! line):

function collect_sample(g)
    r = Int[]
    sizehint!(r, round(Int, length(g) * 0.01))
    for v in g
        if rand() < 0.01
           push!(r, v)
        end
    end
    r
end

EDIT

Here you have examples of self avoiding sampler and reservoir sampler giving you fixed output size. The smaller fraction of the input you want to get the better it is to use self avoiding sampler:

function self_avoiding_sampler(source_size, ith, target_size)
    rng = 1:source_size
    idx = rand(rng)
    x1 = ith(idx)
    r = Vector{typeof(x1)}(undef, target_size)
    r[1] = x1
    s = Set{Int}(idx)
    sizehint!(s, target_size)
    for i = 2:target_size
        while idx in s
            idx = rand(rng)
        end
        @inbounds r[i] = ith(idx)
        push!(s, idx)
    end
    r
end

function reservoir_sampler(g, target_size)
    r = Vector{Int}(undef, target_size)
    for (i, v) in enumerate(g)
        if i <= target_size
            @inbounds r[i] = v
        else
            j = rand(1:i)
            if j < target_size
                @inbounds r[j] = v
            end
        end
    end
    r
end

Upvotes: 2

Related Questions