twk
twk

Reputation: 17320

How to keep a random subset of a stream of data?

I have a stream of events flowing through my servers. It is not feasible for me to store all of them, but I would like to periodically be able to process some of them in aggregate. So, I want to keep a subset of the stream that is a random sampling of everything I've seen, but is capped to a max size.

So, for each new item, I need an algorithm to decide if I should add it to the stored set, or if I should discard it. If I add it, and I'm already at my limit, I need an algorithm to evict one of the old items.

Obviously, this is easy as long as I'm below my limit (just save everything). But how can I maintain a good random sampling without being biased towards old items or new items once I'm past that limit?

Thanks,

Upvotes: 15

Views: 8348

Answers (6)

Chris Dunder
Chris Dunder

Reputation: 309

This is called random sampling. Source: http://en.wikipedia.org/wiki/Reservoir_sampling

array R[k];    // result
integer i, j;

// fill the reservoir array
for each i in 1 to k do
    R[i] := S[i]
done;

// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
    j := random(1, i);   // important: inclusive range
    if j <= k then
        R[j] := S[i]
    fi
done

A decent explanation/proof: http://propersubset.com/2010/04/choosing-random-elements.html

Upvotes: 6

jonderry
jonderry

Reputation: 23633

This is a common interview question.

One easy way to do it is to save the nth element with probability k/n (or 1, whichever is lesser). If you need to remove an element to save the new sample, evict a random element.

This gives you a uniformly random subset of the n elements. If you don't know n, you can estimate it and get an approximately uniform subset.

Upvotes: 6

Enrique
Enrique

Reputation: 10117

This is assuming you dont know the total number of events that will be received and that you don't need a minimum number of elements in the subset.

arr = arr[MAX_SIZE] //Create a new array that will store the events. Assuming first index 1.
counter = 1 //Initialize a counter.

while(receiving event){
  random = //Generate a random number between 1 and counter
  if( counter == random ){
     if( counter <= MAX_SIZE ){
        arr[counter] = event
     }
     else{
        tmpRandom = //Generate a random number between 1 and MAX_SIZE
        arr[tmpRandom] = event
     }
  }
  counter =+ 1

}

Upvotes: 1

Joshua Smith
Joshua Smith

Reputation: 6621

Assign a probability of recording each event and store the event in an indexable data structure. When the size of the structure gets to the threshold, remove a random element and add new elements. In Ruby, you could do this:

@storage = []
prob = 0.002

while ( message = getnextMessage) do
    @storage.delete((rand() * @storage.length).floor) if @storage.length > MAX_LEN
    @storage << message if (rand() < prob) 
end

This addresses your max size AND your non-bias toward when the event occurred. You could also choose which element gets deleted by partitioning your stored elements into buckets and then removing an element from any bucket that has more than one element. The bucket method allows you to keep one from each hour, for example.

You should also know that sampling theory is Big Math. If you need more than a layman's idea about this you should consult a qualified mathematician in your area.

Upvotes: 0

Randy
Randy

Reputation: 16677

store samples in a first in first out (FIFO) queue.

set a sampling rate of so many events between samples, or randomize this a bit - depending on your patterns of events.

save every nth event, or whenever your rate tells you to, then stick it in to the end of the queue.

pop one off the top if the size is too big.

Upvotes: 1

NPE
NPE

Reputation: 500437

While this paper isn't precisely what you're looking for, it may be a good starting point in your search.

Upvotes: 1

Related Questions