BatWannaBe
BatWannaBe

Reputation: 4510

Why random.sample can't handle numpy arrays but random.choices can?

Python's random module has random.choices for sampling with replacement and random.sample for sampling without replacement. Although random.choices accepts a numpy array and returns a list of randomly selected elements with respect to the first dimension, random.sample raises

TypeError: Population must be a sequence or set. For dicts, use list(d).

On the other hand, random.choices will not accept sets, raising

TypeError: 'set' object does not support indexing.

What I'm curious about is if this is an oversight or if there's an essential reason for restricting random.sample to sequences and sets while random.choices is restricted to objects supporting indexing, despite the functions having very similar purposes.

P.S. if anyone is wondering about how to sample an ndarray, numpy.random.choice samples 1darrays both with and without replacement, and higher-dimension arrays can be effectively sampled with respect to any dimension with advanced indexing where the indices for that dimension are generated with numpy.random.choice

Upvotes: 2

Views: 3811

Answers (1)

user2357112
user2357112

Reputation: 280778

random.sample tries to check whether its argument is an instance of collections.abc.Sequence or collections.abc.Set. This is a much less reliable check than many people believe, since it only detects types that concretely inherit from those ABCs or that are explicitly registered. numpy.ndarray doesn't inherit from those classes and isn't registered.

Without the check, or if you explicitly do collections.abc.Sequence.register(numpy.ndarray), random.sample handles numpy.ndarray fine.

Incidentally, numpy.random.choice with replace=False is absurdly inefficient, generating an entire permutation of the input just to take a small sample. It's a longstanding issue that hasn't been fixed due to the fact that the natural fix would change the results for people using seed. As of NumPy 1.17, you should instead use the new Generator API:

rng = numpy.random.default_rng()
result = rng.choice(input, size=whatever, replace=False)

The Generator API is not bound by backward compatibility guarantees with the old API, so they were free to change the algorithm. If you're stuck on an old NumPy, then depending on the parameters, it is often faster to use random.sample, or to compute the sample manually, than to use numpy.random.choice with replace=False.

Upvotes: 3

Related Questions