Jade Tang
Jade Tang

Reputation: 319

Random collect item from an stream

Suppose I have a input stream which I don't know how many items in it. As I collecting items from the stream, I have randomly store some items. Suppose I have to store 1,000 items and the items in stream are much more than 1,000 Is there a good algorithm that I can randomly collect items from the stream, and the items are distributed among the length of the stream as evenly as possible?

Upvotes: 0

Views: 48

Answers (2)

MBo
MBo

Reputation: 80287

If you store items chosen, then you can randomly select k items from stream with even distribution using Reservoir sampling algorithm

for first k elements of stream:
  store element in A array

for every next (ith) element:
    generate random indx in range [0, i)
    if indx < k
       replace A[indx] with current element

Upvotes: 1

Kevin
Kevin

Reputation: 76244

No, you can't get even distribution unless you know ahead of time how many items are in the stream.

Suppose the stream has 10,000 items. To get 1,000 evenly distributed items, you should be collecting every tenth item.

Suppose the stream has 100,000 items. To get 1,000 evenly distributed items, you should be collecting every hundredth item.

But you can't distinguish between these two cases until you reach the end of the stream, at which point it's too late to change your collection frequency. If you start out collecting every tenth item, then you might end up with 1,000 collected items while you still have 90,000 items left to go in the stream. If you start out collecting every hundredth item, you might reach the end of the stream while you're 900 items short of your quota.

Upvotes: 0

Related Questions