Cristiano
Cristiano

Reputation: 151

Calculating quantiles without storing

I wrote c++ code to calculate 119 quantiles (from 10^-7 to 1 - 10^-7) of 100 millions of double precision numbers. My current implementation stores the numbers in a vector and then it sorts the vector. Is there any way to calculate the quantiles without storing the numbers?

Thank you

ADDENDUM (sorry for my English): Here is what I'm doing:

1) generate 20 uniformly distributed random numbers in [0, 1)

2) I feed those numbers into an algorithm that outputs a random number with unknown mean and unknown variance

3) store the number at step 2

repeat 1, 2 and 3 100 millions of times (now I collected 10^8 random numbers with unknown mean and unknown variance).

Now I sort those numbers to calculate 119 quantiles from 10^-7 to 1 - 10^-7 using the formula "R-2, SAS-5": https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample

Since the program is multi-threaded, the memory allocation is too big and I can only use 5 threads instead of 8.

Upvotes: 5

Views: 1879

Answers (2)

Koebmand STO
Koebmand STO

Reputation: 171

You need to know the set of numbers before you can calculate the quantiles.

This can either be done by storing the numbers, but you can also make/use a multi-pass algorithm, that learns a little part each run.

There are also approximate one-pass algorithms for this problem, if some inaccuracy on the quantiles is acceptable. Here is an example: http://www.cs.umd.edu/~samir/498/manku.pdf

EDIT** Forgot, if your numbers have many duplicates, you just need to store the number and how many times it appears, not each duplicate. Depending on the input data this can be a significant difference.

Upvotes: 2

Ami Tavory
Ami Tavory

Reputation: 76316

This is a problem from the field of streaming algorithms (where you need to operate on a stream of data without storing each element).

There are well known algorithms for quantile stream algorithms (e.g., here), but if you are willing to use quantile approximations, it's a fairly easy problem. Simply use reservoir sampling to uniformly sample m out of n elements, and calculate the quantiles on the sample (by the method you did: storing the m samples in a vector, and sorting it). The size m influences the approximation's precision (see, e.g., here).

Upvotes: 4

Related Questions