mgallagher
mgallagher

Reputation: 466

Randomly Sample a python Set without converting to list

Question

I have spent a good amount of time reading various answers about getting a random sample in python, and random.sample seems to be the natural and most common choice, however I am attempting to sample from a python set object and was hoping to do it efficiently.

I am using a set due to the very nice and efficient set functionality in python (intersections, difference, etc.). For my purposes sets are a very efficient data structure, and lists are specifically not. I have an algorithm situation where I have N elements in a set, and potentially need to take up to N subsamples of an arbitrary size for each sampling of the set. Each subsampling of the set is not of the exact same set, and is defined by the properties of each element I must generate a subsample of. Here is some vague code which demonstrates the complexity of the algorithm:

main_set = set(...) # Values sourced from elsewhere.
capacity = 20

for element in list:
    potential_values = main_set - element.set # Exclude values already in element
    sample_size = capacity - len(element.set) # Num needed to fill the set to capacity
    new_vals = sample(potential_values, sample_size) # <- insert sampling idea here

    element.set = element.set | new_vals # Union of sample and element set

From what I have gathered online and in some tests, random.sample appears to convert a set into a list object. The size of main_set - element.set, potential_values is almost always much greater than the size of element.set, and so if potential_values must be transformed into a list on each sampling the algorithm will suffer performance immensely.

So does anyone have any advice or ideas on how to go about doing this efficiently with sets? I appreciate any input on this matter, and before anyone jumps to the 'premature optimization' routine, I have a very good idea of the scale on which this is going to be executing and the difference between O(n) and O(n^2) is pretty substantial.


Clarification Edit:

I specifically do not care about the output of whatever sample()method provided. The actual samples I am pulling from the potential_values are small compared to the size of potential_values. Rather, all of the suggested sample() methods require a list-like input to work, meaning potential_values must be converted to an indexable type first, which is what I wanted to avoid.

Also I realize now that I brought up big-O notation in a really vague way and probably shouldn't have. When I mean I wanted to avoid O(n^2), I really meant I wanted to avoid adding another O(n) operation inside the loop. As it was pointed out to me main_set - element.set is of same time-complexity as list(main_set), so it is already O(n^2). Adding list conversion makes the whole algorithm more like O(2n^2), but none of that is really important.

Upvotes: 4

Views: 3407

Answers (4)

Dima Tisnek
Dima Tisnek

Reputation: 11779

Depending on your definition of random.

Just some elements, I don't care which:

[s.copy().pop() for i in range(count)]  # with replacement

copy = s.copy()
[copy.pop() for i in range(count)]  # without replacement

Elements with decent [pseudo-]random distribution:

copy = list(s)
random.sample(copy, count)

Repeatable pseudo-random distribution:

copy = sorted(s)
# random.seed(...)
random.sample(copy, count)

Repeatable pseudo-random, hypothetically with less runtime overhead:

heapq.nlargest(...)  # per Jon or Marius

Discussion:

  • set.pop() already removes and returns arbitrary element, however it's quite predictable if object hash value is same in set, e.g. if it's same set of numbers every time, it may be acceptable if set is different each time
  • set.copy() is O(N)
  • sorted();list.sort() are O(NlogN) amortised, likely as set is randomised by hash
  • heapq.nlargest could be O(N) per Median of Medians, Python implementation is a binary heap of constant size, making it O(N*log(n)), as N elements are filtered through a heap sieve of size n. Note that sing key= adds a notable linear overhead, thus O(C*N*log(n)), your domain will determine if C*log(n) <?> logN

Upvotes: 0

mgallagher
mgallagher

Reputation: 466

As @user2357112 suggested, here is a Rejection Sampling version of the code in my original question that efficiently samples n elements from the source set given that I am only sampling values from main_set that are not already in elements.set.

main_set = set(...) # Values sourced from elsewhere.
capacity = 20
listed_set = list(main_set) # initially convert set to list so we can sample
for element in list:
    while len(element.set) < capacity
        item = random.choice(listed_set)
        element.set.add(item) # Sets cannot contain duplicates, no conditional required

Although this doesn't answer the question of how to sample directly from a set in python, it does efficiently solve what my algorithm is attempting to do. If after a while, no one comes up with either an idea to sample directly from a set or something more efficient than this, I will probably mark this as the answer. Thanks for the idea @user2357112!


As @LieRyan pointed out, if element.set overlaps with main_set by a large percentage, this algorithm would fail to get a non-overlapping item from random.choice(). Therefore if we are expecting a high overlap, such as perhaps 50%, then simply getting the unique items between the two sets with main_set - element.set, and converting that to a list will be much faster than this method. Essentially this algorithm is for the case when main_set has very little overlap with element.set as a percentage of main_set.

Upvotes: 0

Marius
Marius

Reputation: 60160

A quick attempt at timing in IPython suggests that using heapq.nlargest isn't necessarily better than your existing method, adjust to the characteristics of your actual data as appropriate:

import random
import heapq

set_size = 100000
sample_size = 1000

def sample_heapq(your_set, sample_size):
    sample = heapq.nlargest(sample_size, your_set, key = lambda e: random.random())
    return sample

def sample_original(your_set, sample_size):
    sample = random.sample(your_set, sample_size)
    return sample

eg_set = set(range(sample_size))

Running these through timeit:

%timeit sample_heapq(eg_set, sample_size)
1000 loops, best of 3: 523 µs per loop

%timeit sample_original(eg_set, sample_size)
1000 loops, best of 3: 479 µs per loop

Upvotes: 1

Jon Clements
Jon Clements

Reputation: 142216

You can use heapq.nlargest which can take any iterable and provide it with a random key to pick, eg:

import random, heapq

sample = heapq.nlargest(sample_size, your_set, key=lambda L: random.random())

Note - this will give you a list object back, so you'll need to cater to convert it if necessary...

Upvotes: 1

Related Questions