curious4cs
curious4cs

Reputation: 193

how to approximate 90th percentile given a stream of millions of numbers

I need to calculate the 90th percentile of a stream of numbers that I am getting every second. It could be up to millions of numbers per second, but the 90th percentile just had to be approximated and not necessarily exact. Is a priority queue/max heap the best way to do this, or something else? If so, how would I end up approximating the value?

Upvotes: 2

Views: 3967

Answers (1)

Jim Mischel
Jim Mischel

Reputation: 134005

The method you select will depend on the nature of your data. If you know, before you start receiving the stream of items, how many items you will receive, you can use a heap-based selection algorithm. For example, if you know that you're going to receive 1,000,000 items and you need to know the 90% percentile, then you know that the 100,000th item marks the 90th percentile. To find it, do the following:

create an empty min heap
add the first 100,000 items to the heap
for each remaining item
    if the item is larger than the smallest item on the heap
        remove the smallest item from the heap
        add the new item to the heap

When you're done, the heap contains the 100,000 largest items, and the root of the heap is the smallest of those. That's your 90th percentile value.

A faster way that uses more memory is to save all of the incoming items in a list, and run Quickselect to find the 100,000th largest item.

Both of the above will give you an exact answer.

If you know that your numbers will be within some relatively small range, you can create buckets to store them in. For example, you said that your numbers are in the range 0 through 150. So you need 151 buckets. Your values are not whole numbers, but since you said an approximation is fine, you can round the values before putting them into buckets. So something like this should work:

buckets = array of 151 values
for each value
    int_value = round(value)
    buckets[int_value] = buckets[int_value] + 1

Now that you have a count of each value figuring out the 90th percentile is a simple matter of counting values from the end of the array (the highest values) until you reach 10%. Something like:

target = 100000  // we want the top 10 percent
bucket = 150
total = 0
while (bucket >= 0)
    total += buckets[bucket]
    if (total >= target)
        break
    bucket = bucket - 1

At this point, the value of bucket is your approximate 90 percentile value.

This method will be faster than the other two, and will use considerably less memory. But it's an approximation rather than an exact value.

Upvotes: 8

Related Questions