Andy Jones
Andy Jones

Reputation: 4971

Evenly-spaced samples from a stream of unknown length

I want to sample K items from a stream of N items that I see one at a time. I don't know how big N is until the last item turns up, and I want the space consumption to depend on K rather than N.

So far I've described a reservoir sampling problem. The major ask though is that I'd like the samples to be 'evenly spaced', or at least more evenly spaced than reservoir sampling manages. This is vague; one formalization would be that the sample indices are a low-discrepancy sequence, but I'm not particularly tied to that.

I'd also like the process to be random and every possible sample to have a non-zero probability of appearing, but I'm not particularly tied to this either.

My intuition is that this is a feasible problem, and the algorithm I imagine preferentially drops samples from the 'highest density' part of the reservoir in order to make space for samples from the incoming stream. It also seems like a common enough problem that someone should have written a paper on it, but Googling combinations of 'evenly spaced', 'reservoir', 'quasirandom', 'sampling' haven't gotten me anywhere.

edit #1: An example might help.

Example

Suppose K=3, and I get items 0, 1, 2, 3, 4, 5, ....

After 3 items, the sample would be [0, 1, 2], with spaces of {1}

After 6 items, I'd like to most frequently get [0, 2, 4] with its spaces of {2}, but commonly getting samples like [0, 3, 5] or [0, 2, 4] with spaces of {2, 3} would be good too.

After 9 items, I'd like to most frequently get [0, 4, 8] with its spaces of {4}, but commonly getting samples like [0, 4, 7] with spaces of {4, 3} would be good too.

edit #2: I've learnt a lesson here about providing lots of context when requesting answers. David and Matt's answers are promising, but in case anyone sees this and has a perfect solution, here's some more information:

Context

I have hundreds of low-res videos streaming through a GPU. Each stream is up to 10,000 frames long, and - depending on application - I want to sample 10 to 1000 frames from each. Once a stream is finished and I've got a sample, it's used to train a machine learning algorithm, then thrown away. Another stream is started in its place. The GPU's memory is 10 gigabytes, and a 'good' set of reservoirs occupies a few gigabytes in the current application and plausibly close to the entire memory in future applications.

Upvotes: 1

Views: 440

Answers (3)

Matt Timmermans
Matt Timmermans

Reputation: 59303

Perhaps the simplest way to do reservoir sampling is to associate a random score with each sample, and then use a heap to remember the k samples with the highest scores.

This corresponds to applying a threshold operation to white noise, where the threshold value is chosen to admit the correct number of samples. Every sample has the same chance of being included in the output set, exactly as if k samples were selected uniformly.

If you sample blue noise instead of white noise to produce your scores, however, then applying a threshold operation will produce a low-discrepancy sequence and the samples in your output set will be more evenly spaced. This effect occurs because, while white noise samples are all independent, blue noise samples are temporally anti-correlated.

This technique is used to create pleasing halftone patterns (google Blue Noise Mask). Theoretically, it works for any final sampling ratio, but realistically its limited by numeric precision. I think it has a good chance of working OK for your range of 1-100, but I'd be more comfortable with 1-20.

There are many ways to generate blue noise, but probably your best choices are to apply a high-pass filter to white noise or to construct an approximation directly from 1D Perlin noise.

Upvotes: 1

David Eisenstat
David Eisenstat

Reputation: 65498

Here's an algorithm that doesn't require much extra memory. Hopefully it meets your quality requirements.

The high-level idea is to divide the input into k segments and choose one element uniformly at random from each segment. Given the memory constraint, we can't make the segments as even as we would like, but they'll be within a factor of two.

The simple version of this algorithm (that uses 2k reservoir slots and may return a sample of any size between k and 2k) starts by reading the first k elements, then proceeds in rounds. In round r (counting from zero), we read k 2r elements, using the standard reservoir algorithm to choose one random sample from each segment of 2r. At the end of each round, we append these samples to the existing reservoir and do the following compression step. For each pair of consecutive elements, choose one uniformly at random to retain and discard the other.

The complicated version of this algorithm uses k slots and returns a sample of size k by interleaving the round sampling step with compression. Rather than write a formal description, I'll demonstrate it, since I think that will be easier to understand.

Let k = 8. We pick up after 32 elements have been read. I use the notation [a-b] to mean a random element whose index is between a and b inclusive. The reservoir looks like this:

[0-3] [4-7] [8-11] [12-15] [16-19] [20-23] [24-27] [28-31]

Before we process the next element (32), we have to make room. This means merging [0-3] and [4-7] into [0-7].

[0-7] [32] [8-11] [12-15] [16-19] [20-23] [24-27] [28-31]

We merge the next few elements into [32].

[0-7] [32-39] [8-11] [12-15] [16-19] [20-23] [24-27] [28-31]

Element 40 requires another merge, this time of [16-19] and [20-23]. In general, we do merges in a low-discrepancy order.

[0-7] [32-39] [8-11] [12-15] [16-23] [40] [24-27] [28-31]

Keep going.

[0-7] [32-39] [8-11] [12-15] [16-23] [40-47] [24-27] [28-31]

At the end of the round, the reservoir looks like this.

[0-7] [32-39] [8-15] [48-55] [16-23] [40-47] [24-31] [56-63]

We use standard techniques from FFT to undo the butterfly permutation of the new samples and move them to the end.

[0-7] [8-15] [16-23] [24-31] [32-39] [40-47] [48-55] [56-63]

Then we start the next round.

Upvotes: 1

David Eisenstat
David Eisenstat

Reputation: 65498

If space isn't at a premium, I'd oversample using the uniform random reservoir algorithm by some constant factor (e.g., if you need k items, sample 10k) and remember the index that each sampled item appeared at. At the end, use dynamic programming to choose k indexes to maximize (e.g.) the sum of the logs of the gaps between consecutive chosen indexes.

Upvotes: 1

Related Questions