efritz
efritz

Reputation: 5203

Randomly Evicting Map Values

I need to be able to evict m elements out of a map containing n elements with only access to an iterator. I could, simply, iterate the dictionary once and remove all elements with probability m/n, however this may evict more or less than m items (although the expected number of items removed is correctly m).

int m = 10;
int n = map.size();

Iterator<K> keys = map.keySet().iterator();
while (keys.hasNext()) {
    keys.next();
    if (random.nextDouble() < m / (double) n) {
        keys.remove();
    }
}

The solution I have been thinking of would be to simply stop evicting elements once m elements have been evicted, and at the end of iteration, if evicted < m elements have been evicted, evict the remaining m - evicted elements on a second iteration. I'm worried that this second pass isn't probabilistically correct.

int m = 10;
int n = size();
int evicted = 0;

outer: while (evicted < m) {

Iterator<K> keys = keySet().iterator();
while (keys.hasNext()) {
    keys.next();
    if (random.nextDouble() < m / (double) n) {
        keys.remove();

        if (++evicted == m) {
            break outer;
        }
    }
}

I could, alternatively, keep a list of keys and reservoir-sample the list with one iteration and remove all keys in the list of m keys, but there is a bit of memory overhead that I'd rather not be forced to use. Also, removing with the iterator is faster than removing an element by key (it required finding the bucket the key is stored in, and then its place in the list). Is there another online algorithm I could use with only access to an iterator (without creating a separate list)?

EDIT: For those interested, I've found a paper detailing how to generate a random distribution in order so that a separate sorting step is not required. The code is something like this (may contain duplicates when truncated to an integer):

int curmax = 1.0;
int[] indices = new int[m];
for (int i = indices.length; i >= 0; i--) {
    curmax = curmax * Math.pow(random.nextDouble(), 1 / (double) (i+1));
    indices[i] = (int) curmax;
}

Upvotes: 2

Views: 442

Answers (2)

Ben Allison
Ben Allison

Reputation: 7394

The correct way to do this is to remove each element with probability m/n, but renormalise probabilities depending on the result (if we remove an element, decrememnt m, and the current probability needs to be scaled by the number of elements left to choose from). My java is a bit rusty and I don't have access to a compiler, so excuse me if this doesn't quite work (but you should be able to fix it without too much hassle I hope):

int seen = 0

Iterator<K> keys = map.keySet().iterator();
while (keys.hasNext()) {
    if (m==0)
      break;

    keys.next();
    prob = m / (double)(n-seen)  //renormalise the prob so that the total available is 1 across all remaining instances
    if (random.nextDouble() < prob) {
        keys.remove();
        m--;
    }
    seen++;
}

I hope the logic here is clear---this is a generalisation of how to sample one element from the set with probability 1/n, where once you've rejected an element you can ignore it and consider a distribution over all remaining elements. This should ensure you return exactly m elements with the correct probabilities.

EDIT:

Fixed a couple of typos and removed redundant variable.

Upvotes: 1

Eric J.
Eric J.

Reputation: 150108

Why not just evict the first M elements and then stop the iterator? Is there some ordering reflected in the iterator that would present an unwanted bias to the evicted elements?

If yes, your two-pass approach will not be statistically flawless. If the first pass terminates early because you reached M before reaching the end of the iteration, the later elements were never considered for eviction.

If you reach the end of the iteration without having evicted M elements, the first elements of the iteration will "risk" eviction twice, while ones toward the end of the iteration will risk eviction only once.

If you know N in advance, you can build a list of M random, non-repeating numbers between 0 and N. Iterate once, keeping a count of where you are in the iteration. If the iteration number is on your "eviction list", evict that element.

Following that approach, you must only allocate memory for M index positions (probably integers) temporarily, for the iteration process.

Upvotes: 3

Related Questions