Reputation: 385
I am trying to sample data from a list of integers. The tricky part is that each sample should have a different size to emulate some other data I have. I am doing a for loop right now that can do the job, but I was just wondering if there are faster ways that I am not aware of.
Since I think random.sample
is supposed to be fast, I am doing:
result = []
for i in range(100000):
size = list_of_sizes[i]
result.append(random.sample(data, size))
So the result I get is something like:
>>>list_of_sizes
[3, 4, 1, 2,...]
>>>result
[[1, 2, 3],
[3, 6, 2, 8],
[9],
[10, 100],
...]
I have tried using np.random.choice(data, size, replace=False)
and random.sample(data, k=size)
, but they don't allow giving an array of different sizes to vectorize the operation (when np.random.choice
takes an array in the size
parameter, it creates a tensor whose output's shape is that of size
, but not an array of samples). Ideally, I would be expecting something like:
>>>np.random.choice(data, list_of_sizes, replace=False)
[[1, 2, 3],
[3, 6, 2, 8],
[9],
[10, 100],
...]
Upvotes: 3
Views: 3252
Reputation: 36249
Depending on your hardware as well as data sizes, using multiprocessing can give a sizeable speed-up. This needs to be estimated for your specific problem setup, however. For example using multiprocessing.pool.Pool
:
from functools import partial
from multiprocessing.pool import Pool
with Pool() as pool:
result = pool.map(partial(sample, data), sizes)
Here are some example results (using 4 CPU cores):
from functools import partial
from multiprocessing.pool import Pool
from random import choices, sample
from statistics import mean, stdev
import time
def baseline(data, sizes):
return [sample(data, k) for k in sizes]
def multiprocessing(data, sizes):
with Pool(4) as pool:
return pool.map(partial(sample, data), sizes)
def timeit(f, *args, n=7):
timings = []
for __ in range(n):
t_start = time.time() # using time because of multiprocessing
f(*args)
t_stop = time.time()
timings.append(t_stop - t_start)
print(f'[{f.__name__}] {mean(timings):.2f} +/- {stdev(timings):.2f} s')
data = list(range(1_000_000))
sizes = choices(range(max(data) // 100), k=1_000)
timeit(baseline, data, sizes)
timeit(multiprocessing, data, sizes)
which gives:
[baseline] 3.19 +/- 0.07 s
[multiprocessing] 2.10 +/- 0.02 s
But again, this depends on hardware and data, so it needs to be checked on each individual system.
Upvotes: 1
Reputation: 11602
It seems that np.random.choice
is indeed not optimized for choice with replacement. However, you can get better performance by using Generator.choice
, as discussed here.
I see a 14x speedup for your parameters:
data = np.arange(10**6)
sample_sizes = np.random.randint(1, 70_000, 100)
def f(data, sample_sizes):
result = []
for s in sample_sizes:
result.append(np.random.choice(data, s, replace=False))
def f2(data, sample_sizes):
g = np.random.Generator(np.random.PCG64())
n = data.shape[0]
return [data[g.choice(n, k, replace=False)] for k in sample_sizes]
%timeit f(data, sample_sizes)
%timeit f2(data, sample_sizes)
1 loop, best of 3: 5.18 s per loop
1 loop, best of 3: 375 ms per loop
Upvotes: 3