What is the fastest way to check for existence and get random element from data structure in python?

Question

I need a data structure (preferably a built in python type) that allows for existence check in O(1) and also random element choosing in O(1). The elements to store in the structure are unique integers. I have considered this options:

list: O(1) random choice with random.choice, but O(n) existence with in operator.
set: O(1) existence check with in operator, but can't use random.choice without casting to tuple (O(n))
tuple: O(1) random choice with random.choice, but O(n) check for existence
dictionary: O(1) existence check , but memory use is at least 2 times as much compared to the other options (keys and values) , and random.choice does not work without casting to list or getting items (assumably O(n) operations).

Do I need to implement a compound data structure? What is the best approach? PD: I am using Python 3 by the way.

EDIT:

The data structure is created from a list comprehension with a range generator, so the data inside the sequence is always ordered. Something like this:

data=[x for x in range(n) if condition]

kaya3 · Accepted Answer

There are a few approaches for this. It depends how important it is to minimise memory usage. From the comments you've said it's not necessary to be able to add or remove elements.

Implicit set

For the special case where you know both the range that the numbers are drawn from, and the condition for them to be included in the set, and you also only need to do small number of membership and sampling operations, then you don't need to build any data structure at all. You can use the condition to test membership, and the range to sample from it:

import random

# example usage:
# >>> s = ImplicitSet(range(10), lambda x: x > 4)

class ImplicitSet:
    def __init__(self, range, condition):
        self.range = range
        self.condition = condition
    def contains(self, x):
        return x in self.range and self.condition(x)
    def sample(self):
        # you'll need to change this if the condition might never be true
        while True:
            x = random.choice(range)
            if self.condition(x):
                return x

Unfortunately you can't just generate one random number and iterate until you find a member of the set, since that would bias the sampling towards numbers with bigger "gaps" before them.

This takes O(1) memory, because you aren't storing any actual data structure. The contains operation takes whatever time it takes to test the condition, but if you're only testing membership for a small number of elements, it saves time compared to a list comprehension which tests the condition for every element of the range.

The sample operation requires a number of tries inversely proportional to the density of elements satisfying the condition within the range, so e.g. if the condition is true for 25% of the elements then you need four tries on average. So sampling takes constant time, but the size of the constant depends on the distribution of the actual numbers. Again, if you're only sampling a few times then this is much better than a list comprehension having to test the condition for every number in the range.

List and set

This way is the simplest to implement, and gives both membership tests (using the set) and random sampling (using the list) in constant time. However, it stores each integer in two containers instead of one, using more memory.

import random

class ListAndSet:
    def __init__(self, numbers):
        self.numbers_list = list(numbers)
        self.numbers_set = set(self.numbers_list) # numbers may be a generator
    def contains(self, x):
        return x in self.numbers_set
    def sample(self):
        return random.choice(self.numbers_list)

Sorted list

This one is also pretty simple to implement, and it uses the least possible memory unless your numbers are within a fixed size, in which case a sorted array is better. It takes O(n log n) time to sort the numbers in the first place, but after that, membership tests take O(log n) time using a binary search, and sampling takes O(1) time.

import bisect
import random

class SortedList:
    def __init__(self, numbers):
        self.sorted_list = sorted(numbers)
    def contains(self, x):
        index = bisect.bisect_left(self.sorted_list, x)
        return index < len(self.sorted_list) and self.sorted_list[index] == x
    def sample(self):
        return random.choice(self.sorted_list)

Dense set

If the numbers are densely packed within some range, you could store them in a set, and then sample by generating random numbers in that range until you find one that's in the set. The expected number of tries is inversely proportional to the density of the numbers within the range, so e.g. if you have 1,000 numbers within a range of size 4,000, you will need four tries on average.

import random

class DenseSet:
    def __init__(self, numbers):
        self.numbers_set = set(numbers)
        # you'll need to change this if the set might sometimes be empty
        self.low = min(self.numbers_set)
        self.high = max(self.numbers_set)
    def contains(self, x):
        return x in self.numbers_set
    def sample(self):
        # you'll need to change this if the set might sometimes be empty
        while True:
            x = random.randint(self.low, self.high)
            if x in self.numbers_set:
                return x

Your own hash-set

If O(log n) time isn't good enough, and the numbers are sparse, but you absolutely can't afford the memory required for both a list and a set, then you could implement your own hash-set; you can find example code by searching online. To sample in O(1) expected time, repeatedly generate random indices in the underlying list until you find a slot that is non-empty. This guarantees a uniform distribution of the samples. The expected number of tries is inversely proportional to the hash-set's load factor, so e.g. with a load factor of 0.8 you would need on average 1.25 tries.

Unfortunately you can't just generate one index and then iterate through until you find a non-empty slot, since the sampling would be biased towards elements with more empty slots behind them.