Reputation: 4510
Python's random
module has random.choices
for sampling with replacement and random.sample
for sampling without replacement. Although random.choices
accepts a numpy array and returns a list of randomly selected elements with respect to the first dimension, random.sample
raises
TypeError: Population must be a sequence or set. For dicts, use list(d).
On the other hand, random.choices
will not accept sets, raising
TypeError: 'set' object does not support indexing.
What I'm curious about is if this is an oversight or if there's an essential reason for restricting random.sample
to sequences and sets while random.choices
is restricted to objects supporting indexing, despite the functions having very similar purposes.
P.S. if anyone is wondering about how to sample an ndarray, numpy.random.choice
samples 1darrays both with and without replacement, and higher-dimension arrays can be effectively sampled with respect to any dimension with advanced indexing where the indices for that dimension are generated with numpy.random.choice
Upvotes: 2
Views: 3811
Reputation: 280778
random.sample
tries to check whether its argument is an instance of collections.abc.Sequence
or collections.abc.Set
. This is a much less reliable check than many people believe, since it only detects types that concretely inherit from those ABCs or that are explicitly registered. numpy.ndarray
doesn't inherit from those classes and isn't registered.
Without the check, or if you explicitly do collections.abc.Sequence.register(numpy.ndarray)
, random.sample
handles numpy.ndarray
fine.
Incidentally, numpy.random.choice
with replace=False
is absurdly inefficient, generating an entire permutation of the input just to take a small sample. It's a longstanding issue that hasn't been fixed due to the fact that the natural fix would change the results for people using seed
. As of NumPy 1.17, you should instead use the new Generator API:
rng = numpy.random.default_rng()
result = rng.choice(input, size=whatever, replace=False)
The Generator API is not bound by backward compatibility guarantees with the old API, so they were free to change the algorithm. If you're stuck on an old NumPy, then depending on the parameters, it is often faster to use random.sample
, or to compute the sample manually, than to use numpy.random.choice
with replace=False
.
Upvotes: 3