efficient algorithm for computing quantiles in terabytes dataset

Question

I am trying to compute quantiles (can be approximate with some accuracy guarantees or error bounds) for a huge dataset (terabytes of data) . How can i efficiently compute quantiles . The requirements are

1) Can be computed efficiently (one-pass) or in a distributed way (merging)
2) High accuracy (or at least can be controlled)
3) Can be re-computed or reproduced in multiple language (java and python)
4) Incrementally updated (not a requirement but good to have)

The few approaches I am looking at are :

1) The naive solution : reservoir sampling (not sure how to do it in
distributed map reduce way specially how to merge different reservoir samples for same data or two different distributions, are there any
good implementations ? )

2) t-digest

3) Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Approximate medians and other quantiles in one pass and with
limited memory. (Reason being i think some map reduce frameworks like dataflow and BigQuery already implement variation of this AFAIK)

Can someone who has prior experience with working with these algorithm and techniques provide me some pointers about what are the caveats, pros and cons for each . When to use which method, is one approach arguably better than other if requirement is efficient computation and accuracy better.

I have not in particular used digest based approach and would like to understand better why and when would i prefer something like t-digest over something simple like reservoir sampling to compute the approximate quantiles.

efficient algorithm for computing quantiles in terabytes dataset

Answers (1)

Related Questions