Reputation: 871
I am trying to compute quantiles (can be approximate with some accuracy guarantees or error bounds) for a huge dataset (terabytes of data) . How can i efficiently compute quantiles . The requirements are
1) Can be computed efficiently (one-pass) or in a distributed way (merging)
2) High accuracy (or at least can be controlled)
3) Can be re-computed or reproduced in multiple language (java and python)
4) Incrementally updated (not a requirement but good to have)
The few approaches I am looking at are :
1) The naive solution : reservoir sampling (not sure how to do it in
distributed map reduce way specially how to merge different reservoir samples for same data or two different distributions, are there any
good implementations ? )2) t-digest
3) Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Approximate medians and other quantiles in one pass and with
limited memory. (Reason being i think some map reduce frameworks like dataflow and BigQuery already implement variation of this AFAIK)
Can someone who has prior experience with working with these algorithm and techniques provide me some pointers about what are the caveats, pros and cons for each . When to use which method, is one approach arguably better than other if requirement is efficient computation and accuracy better.
I have not in particular used digest based approach and would like to understand better why and when would i prefer something like t-digest over something simple like reservoir sampling to compute the approximate quantiles.
Upvotes: 1
Views: 779
Reputation: 17913
UPDATE: it appears that a new and very good algorithm has appeared, called KLL. See paper. It has an implementation in Python and in Go.
t-digest has implementations in several languages and satisfies all of your requirements. See the paper that makes comparisons to some other algorithms, e.g. to Q-Digest. You can look for more comparisons in the Q-Digest paper.
Generally, both of these algorithms are far superior to sampling-based algorithms for estimating quantiles, in terms of giving much better accuracy given the same amount of storage. You can look for a discussion of many more approximate algorithms in the excellent book Data Streams: Algorithms and Applications (it does not discuss t-digest because it was created after the book was published).
There might be other, better algorithms that I'm not familiar with.
There is currently no Beam wrapper for the t-digest library, but it should not be difficult to develop one using a custom CombineFn
. See, for example, a current pending PR adding support for a different approximate algorithm using a CombineFn
.
Upvotes: 1