Reputation: 466
I have spent a good amount of time reading various answers about getting a random sample in python, and random.sample
seems to be the natural and most common choice, however I am attempting to sample from a python set
object and was hoping to do it efficiently.
I am using a set due to the very nice and efficient set functionality in python (intersections, difference, etc.). For my purposes sets are a very efficient data structure, and lists are specifically not. I have an algorithm situation where I have N
elements in a set, and potentially need to take up to N
subsamples of an arbitrary size for each sampling of the set. Each subsampling of the set is not of the exact same set, and is defined by the properties of each element I must generate a subsample of. Here is some vague code which demonstrates the complexity of the algorithm:
main_set = set(...) # Values sourced from elsewhere.
capacity = 20
for element in list:
potential_values = main_set - element.set # Exclude values already in element
sample_size = capacity - len(element.set) # Num needed to fill the set to capacity
new_vals = sample(potential_values, sample_size) # <- insert sampling idea here
element.set = element.set | new_vals # Union of sample and element set
From what I have gathered online and in some tests, random.sample
appears to convert a set
into a list
object. The size of main_set - element.set
, potential_values
is almost always much greater than the size of element.set
, and so if potential_values must be transformed into a list on each sampling the algorithm will suffer performance immensely.
So does anyone have any advice or ideas on how to go about doing this efficiently with sets? I appreciate any input on this matter, and before anyone jumps to the 'premature optimization' routine, I have a very good idea of the scale on which this is going to be executing and the difference between O(n) and O(n^2) is pretty substantial.
I specifically do not care about the output of whatever sample()
method provided. The actual samples I am pulling from the potential_values
are small compared to the size of potential_values
. Rather, all of the suggested sample()
methods require a list-like input to work, meaning potential_values
must be converted to an indexable type first, which is what I wanted to avoid.
Also I realize now that I brought up big-O notation in a really vague way and probably shouldn't have. When I mean I wanted to avoid O(n^2), I really meant I wanted to avoid adding another O(n) operation inside the loop. As it was pointed out to me main_set - element.set
is of same time-complexity as list(main_set)
, so it is already O(n^2). Adding list
conversion makes the whole algorithm more like O(2n^2), but none of that is really important.
Upvotes: 4
Views: 3407
Reputation: 11779
Depending on your definition of random.
Just some elements, I don't care which:
[s.copy().pop() for i in range(count)] # with replacement
copy = s.copy()
[copy.pop() for i in range(count)] # without replacement
Elements with decent [pseudo-]random distribution:
copy = list(s)
random.sample(copy, count)
Repeatable pseudo-random distribution:
copy = sorted(s)
# random.seed(...)
random.sample(copy, count)
Repeatable pseudo-random, hypothetically with less runtime overhead:
heapq.nlargest(...) # per Jon or Marius
Discussion:
set.pop()
already removes and returns arbitrary element, however it's quite predictable if object hash value is same in set, e.g. if it's same set of numbers every time, it may be acceptable if set is different each timeset.copy()
is O(N)
sorted();list.sort()
are O(NlogN)
amortised, likely as set is randomised by hashheapq.nlargest
could be O(N)
per Median of Medians, Python implementation is a binary heap of constant size, making it O(N*log(n))
, as N elements are filtered through a heap sieve of size n. Note that sing key=
adds a notable linear overhead, thus O(C*N*log(n))
, your domain will determine if C*log(n) <?> logN
Upvotes: 0
Reputation: 466
As @user2357112 suggested, here is a Rejection Sampling version of the code in my original question that efficiently samples n elements from the source set given that I am only sampling values from main_set
that are not already in elements.set
.
main_set = set(...) # Values sourced from elsewhere.
capacity = 20
listed_set = list(main_set) # initially convert set to list so we can sample
for element in list:
while len(element.set) < capacity
item = random.choice(listed_set)
element.set.add(item) # Sets cannot contain duplicates, no conditional required
Although this doesn't answer the question of how to sample directly from a set
in python, it does efficiently solve what my algorithm is attempting to do. If after a while, no one comes up with either an idea to sample directly from a set or something more efficient than this, I will probably mark this as the answer. Thanks for the idea @user2357112!
As @LieRyan pointed out, if element.set
overlaps with main_set
by a large percentage, this algorithm would fail to get a non-overlapping item from random.choice()
. Therefore if we are expecting a high overlap, such as perhaps 50%, then simply getting the unique items between the two sets with main_set - element.set
, and converting that to a list will be much faster than this method. Essentially this algorithm is for the case when main_set
has very little overlap with element.set
as a percentage of main_set
.
Upvotes: 0
Reputation: 60160
A quick attempt at timing in IPython suggests that using heapq.nlargest
isn't necessarily better than your existing method, adjust to the characteristics of your actual data as appropriate:
import random
import heapq
set_size = 100000
sample_size = 1000
def sample_heapq(your_set, sample_size):
sample = heapq.nlargest(sample_size, your_set, key = lambda e: random.random())
return sample
def sample_original(your_set, sample_size):
sample = random.sample(your_set, sample_size)
return sample
eg_set = set(range(sample_size))
Running these through timeit
:
%timeit sample_heapq(eg_set, sample_size)
1000 loops, best of 3: 523 µs per loop
%timeit sample_original(eg_set, sample_size)
1000 loops, best of 3: 479 µs per loop
Upvotes: 1
Reputation: 142216
You can use heapq.nlargest
which can take any iterable and provide it with a random key to pick, eg:
import random, heapq
sample = heapq.nlargest(sample_size, your_set, key=lambda L: random.random())
Note - this will give you a list
object back, so you'll need to cater to convert it if necessary...
Upvotes: 1