Reputation: 460
I need a data structure (preferably a built in python type) that allows for existence check in O(1) and also random element choosing in O(1). The elements to store in the structure are unique integers. I have considered this options:
Do I need to implement a compound data structure? What is the best approach? PD: I am using Python 3 by the way.
EDIT:
The data structure is created from a list comprehension with a range generator, so the data inside the sequence is always ordered. Something like this:
data=[x for x in range(n) if condition]
Upvotes: 0
Views: 1181
Reputation: 51063
There are a few approaches for this. It depends how important it is to minimise memory usage. From the comments you've said it's not necessary to be able to add or remove elements.
For the special case where you know both the range that the numbers are drawn from, and the condition for them to be included in the set, and you also only need to do small number of membership and sampling operations, then you don't need to build any data structure at all. You can use the condition to test membership, and the range to sample from it:
import random
# example usage:
# >>> s = ImplicitSet(range(10), lambda x: x > 4)
class ImplicitSet:
def __init__(self, range, condition):
self.range = range
self.condition = condition
def contains(self, x):
return x in self.range and self.condition(x)
def sample(self):
# you'll need to change this if the condition might never be true
while True:
x = random.choice(range)
if self.condition(x):
return x
Unfortunately you can't just generate one random number and iterate until you find a member of the set, since that would bias the sampling towards numbers with bigger "gaps" before them.
This takes O(1) memory, because you aren't storing any actual data structure. The contains
operation takes whatever time it takes to test the condition, but if you're only testing membership for a small number of elements, it saves time compared to a list comprehension which tests the condition for every element of the range.
The sample
operation requires a number of tries inversely proportional to the density of elements satisfying the condition within the range, so e.g. if the condition is true for 25% of the elements then you need four tries on average. So sampling takes constant time, but the size of the constant depends on the distribution of the actual numbers. Again, if you're only sampling a few times then this is much better than a list comprehension having to test the condition for every number in the range.
This way is the simplest to implement, and gives both membership tests (using the set) and random sampling (using the list) in constant time. However, it stores each integer in two containers instead of one, using more memory.
import random
class ListAndSet:
def __init__(self, numbers):
self.numbers_list = list(numbers)
self.numbers_set = set(self.numbers_list) # numbers may be a generator
def contains(self, x):
return x in self.numbers_set
def sample(self):
return random.choice(self.numbers_list)
This one is also pretty simple to implement, and it uses the least possible memory unless your numbers are within a fixed size, in which case a sorted array
is better. It takes O(n log n) time to sort the numbers in the first place, but after that, membership tests take O(log n) time using a binary search, and sampling takes O(1) time.
import bisect
import random
class SortedList:
def __init__(self, numbers):
self.sorted_list = sorted(numbers)
def contains(self, x):
index = bisect.bisect_left(self.sorted_list, x)
return index < len(self.sorted_list) and self.sorted_list[index] == x
def sample(self):
return random.choice(self.sorted_list)
If the numbers are densely packed within some range, you could store them in a set, and then sample by generating random numbers in that range until you find one that's in the set. The expected number of tries is inversely proportional to the density of the numbers within the range, so e.g. if you have 1,000 numbers within a range of size 4,000, you will need four tries on average.
import random
class DenseSet:
def __init__(self, numbers):
self.numbers_set = set(numbers)
# you'll need to change this if the set might sometimes be empty
self.low = min(self.numbers_set)
self.high = max(self.numbers_set)
def contains(self, x):
return x in self.numbers_set
def sample(self):
# you'll need to change this if the set might sometimes be empty
while True:
x = random.randint(self.low, self.high)
if x in self.numbers_set:
return x
If O(log n) time isn't good enough, and the numbers are sparse, but you absolutely can't afford the memory required for both a list and a set, then you could implement your own hash-set; you can find example code by searching online. To sample in O(1) expected time, repeatedly generate random indices in the underlying list until you find a slot that is non-empty. This guarantees a uniform distribution of the samples. The expected number of tries is inversely proportional to the hash-set's load factor, so e.g. with a load factor of 0.8 you would need on average 1.25 tries.
Unfortunately you can't just generate one index and then iterate through until you find a non-empty slot, since the sampling would be biased towards elements with more empty slots behind them.
Upvotes: 2
Reputation: 432
Unless you have strong memory constraints, I would go with a list and a set, that you keep synchronized. You check with the set for existence but pick at random from the list.
class RandomChoice:
def __init__(self, data):
self.list = list(data)
self.set = set(self.list)
def add(self, element):
if element not in self.set:
self.list.append(element)
self.set.add(element)
def contains(self, element):
return self.set.contains(element)
def get_random(self):
return random.choice(self.list)
Of course it takes twice the storage than a single list or a single set, but it does the job.
Upvotes: 1