Reputation: 63704

How to efficiently select a random element from a std::set

How can I efficiently select a random element from a std::set?

A std::set::iterator is not a random access iterator. So I can't directly index a randomly chosen element like I could for a std::deque or std::vector

I could take the iterator returned from std::set::begin() and increment it a random number of times in the range [0,std::set::size()), but that seems to be doing a lot of unnecessary work. For an "index" close to the set's size, I would end up traversing the entire first half of the internal tree structure, even though it's already known the element won't be found there.

Is there a better approach?

In the name of efficiency, I am willing to define "random" as less random than whatever approach I might have used to choose a random index in a vector. Call it "reasonably random".

Edit...

Many insightful answers below.

The short version is that even though you can find a specific element in log(n) time, you can't find an arbitrary element in that time through the std::set interface.

Upvotes: 12

Answers (7)

Huazuo Gao

Reputation: 1733

Average O(1)/O(log N) (hashable/unhashable) insert/delete/sample with off-the-shelf containers

The idea is simple: use rejection sampling while upper bounding the rejection rate, which is achievable with a amortized O(1) compaction operation.

However, unlike solutions based on augmented trees, this approach cannot be extended to support weighted sampling.

template <typename T>
class UniformSamplingSet {
    size_t max_id = 0;
    std::unordered_set<size_t> unused_ids;
    std::unordered_map<size_t, T> id2value;
    std::map<T, size_t> value2id;

    void compact() {
        size_t id = 0;
        std::map<T, size_t> new_value2id;
        std::unordered_map<size_t, T> new_id2value;
        for (auto [_, value] : id2value) {
            new_value2id.emplace(value, id);
            new_id2value.emplace(id, value);
            ++id;
        }
        max_id = id;
        unused_ids.clear();
        std::swap(id2value, new_id2value);
        std::swap(value2id, new_value2id);
    }

public:
    size_t size() {
        return id2value.size();
    }

    void insert(const T& value) {
        size_t id;
        if (!unused_ids.empty()) {
            id = *unused_ids.begin();
            unused_ids.erase(unused_ids.begin());
        } else {
            id = max_id++;
        }
        if (!value2id.emplace(value, id).second) {
            unused_ids.insert(id);
        } else {
            id2value.emplace(id, value);
        }
    }

    void erase(const T& value) {
        auto it = value2id.find(value);
        if (it == value2id.end()) return;
        unused_ids.insert(it->second);
        id2value.erase(it->second);
        value2id.erase(it);
        if (unused_ids.size() * 2 > max_id) {
            compact();
        };
    }

    // uniform(n): uniform random in [0, n)
    template <typename F>
    T sample(F&& uniform) {
        size_t i;
        do { i = uniform(max_id); } while (unused_ids.find(i) != unused_ids.end());
        return id2value.at(i);
    }

Upvotes: 0

Ben Jackson

Reputation: 93690

What about a predicate for find (or lower_bound) which causes a random tree traversal? You'd have to tell it the size of the set so it could estimate the height of the tree and sometimes terminate before leaf nodes.

Edit: I realized the problem with this is that std::lower_bound takes a predicate but does not have any tree-like behavior (internally it uses std::advance which is discussed in the comments of another answer). std::set<>::lower_bound uses the predicate of the set, which cannot be random and still have set-like behavior.

Aha, you can't use a different predicate, but you can use a mutable predicate. Since std::set passes the predicate object around by value you must use a predicate & as the predicate so you can reach in and modify it (setting it to "randomize" mode).

Here's a quasi-working example. Unfortunately I can't wrap my brain around the right random predicate so my randomness is not excellent, but I'm sure someone can figure that out:

#include <iostream>
#include <set>
#include <stdlib.h>
#include <time.h>

using namespace std;

template <typename T>
struct RandomPredicate {
    RandomPredicate() : size(0), randomize(false) { }
    bool operator () (const T& a, const T& b) {
        if (!randomize)
            return a < b;

        int r = rand();
        if (size == 0)
            return false;
        else if (r % size == 0) {
            size = 0;
            return false;
        } else {
            size /= 2;
            return r & 1;
        }
    }

    size_t size;
    bool randomize;
};

int main()
{
    srand(time(0));

    RandomPredicate<int> pred;
    set<int, RandomPredicate<int> & > s(pred);
    for (int i = 0; i < 100; ++i)
        s.insert(i);

    pred.randomize = true;
    for (int i = 0; i < 100; ++i) {
        pred.size = s.size();
        set<int, RandomPredicate<int> >::iterator it = s.lower_bound(0);
        cout << *it << endl;
    }
}

Upvotes: 4

Chris

Reputation: 26878

You can do this by maintaining a normal array of values; when you insert to the set, you append the element to the end of the array (O(1)), then when you want to generate a random number you can grab it from the array in O(1) as well.

The issue comes when you want to remove elements from the array. The most naive method would take O(n), which might be efficient enough for your needs. However, this can be improved to O(log n) using the following method;

Keep, for each index i in the array, prfx[i], which represents the number of non-deleted elements in the range 0...i in the array. Keep a segment tree, where you keep the maximum prfx[i] contained in each range.

Updating the segment tree can be done in O(log n) per deletion. Now, when you want to access the random number, you query the segment tree to find the "real" index of the number (by finding the earliest range in which the maximum prfx is equal to the random index). This makes the random-number generation of complexity O(log n).

Upvotes: 1

Benjamin Lindley

Reputation: 103693

Use boost::container::flat_set instead:

boost::container::flat_set<int> set;
// ...
auto it = set.begin() + rand() % set.size();

Insertions and deletions become O(N) though, I don't know if that's a problem. You still have O(log N) lookups, and the fact that the container is contiguous gives an overall improvement that often outweighs the loss of O(log N) insertions and deletions.

Upvotes: 8

Mark B

Reputation: 96233

If either the set doesn't update frequently or you don't need to run this algorithm frequently, keep a mirrored copy of the data in a vector (or just copy the set to a vector on need) and randomly select from that.

Another approach, as seen in a comment, is to keep a vector of iterators into the set (they're only invalidated on element deletion for sets) and randomly select an iterator.

Finally if you don't need a tree-based set, you could use vector or deque as your underlying container and sort/unique-ify when needed.

Upvotes: 1

Chris

Reputation: 26878

You can use the std::advance method:

set <int> myset;
//insert some elements into myset
int rnd = rand() % myset.size();
set <int> :: const_iterator it(myset.begin());
advance(it, rnd);
//now 'it' points to your random element

Another way to do this, probably less random:

int mini = *myset().begin(), maxi = *myset().rbegin();
int rnd = rand() % (maxi - mini + 1) + mini;
int rndresult = *myset.lower_bound(rnd);

Upvotes: 1

ecatmur

Reputation: 157314

If you could access the underlying red-black tree (assuming that one exists) then you could access a random node in O(log n) choosing L/R as the successive bits of a ceil(log2(n))-bit random integer. However, you can't, as the underlying data structure is not exposed by the standard.

Xeo's solution of placing iterators in a vector is O(n) time and space to set up, but amortized constant overall. This compares favourably to std::next, which is O(n) time.

Upvotes: 2

How to efficiently select a random element from a std::set

Answers (7)

Average O(1)/O(log N) (hashable/unhashable) insert/delete/sample with off-the-shelf containers

Related Questions