Calculate quantiles for large data

Question

I have about 300 files, each containing 1000 time series realisations (~76 MB each file).

I want to calculate the quantiles (0.05, 0.50, 0.95) at each time step from the full set of 300000 realisations.

I cannot merge together the realisations in 1 file because it would become too large.

What's the most efficient way of doing this?

Each matrix is generated by running a model, however here is a sample containing random numbers:

x <- matrix(rexp(10000000, rate=.1), nrow=1000)

Ari B. Friedman · Accepted Answer

There are at least three options:

Are you sure it has to be from the full set? A 10% sample should be a very, very good approximation here.
300k elements isn't that big of a vector, but a 300k x 100+ column matrix is big. Pull just the column you need into memory rather than the entire matrix (can be repeated over every column if necessary).
Do it sequentially, possibly in conjunction with a smaller sample to get you started in the right ballpark. For the 5th percentile, you just need to know how many items are above the current guess and how many are below. So something like:
1. Take a 1% sample, find the 5th percentile of it. Jump some tolerance above and below, such that you're sure the exact 5th percentile lies in that range.
2. Read in the matrix in chunks. For each chunk, count the number of observations above the range and below the range. Then retain all observations which lie within the range.
3. When you've read in the last chunk, you now have three pieces of information (count above, count below, vector of observations within). One way to take a quantile is to sort the whole vector and find the nth observation, and you can do that with the above pieces of information: sort the within-range observations, and find the (n-count_below)th.

Edit: Example of (3).

Note that I am not a champion algorithm designer and that someone has almost certainly designed a better algorithm for this. Also, this implementation is not particularly efficient. If speed matters to you, consider Rcpp, or even just more optimized R for this. Making a bunch of lists and then extracting values from them is not so smart, but it was easy to prototype this way so I went with it.

library(plyr)

set.seed(1)

# -- Configuration -- #
desiredQuantile <- .25

# -- Generate sample data -- #

# Use some algorithm (sampling, iteration, or something else to come up with a range you're sure the true value lies within)
guessedrange <- c( .2, .3 )
# Group the observations to correspond to the OP's files
dat <- data.frame( group = rep( seq(100), each=100 ), value = runif(10000) )

# -- Apply the algorithm -- #

# Count the number above/below and return the values within the range, by group
res <- dlply( dat, .( group ), function( x, guessedrange ) {
  above <- x$value > guessedrange[2]
  below <- x$value < guessedrange[1]
  list(
    aboveCount  = sum( above ),
    belowCount = sum( below ),
    withinValues = x$value[ !above & !below ]
  )
}, guessedrange = guessedrange )
# Exract the count of values below and the values within the range
belowCount <- sum( sapply( res, function(x) x$belowCount ) )
belowCount
withinValues <- do.call( c, sapply( res, function(x) x$withinValues ) )
str(withinValues)
# Count up until we find the within value we want
desiredQuantileCount <- floor( desiredQuantile * nrow(dat) ) #! Should fix this so it averages when there's a tie
sort(withinValues)[ desiredQuantileCount - belowCount + 1 ]
# Compare to exact value
quantile( dat$value, desiredQuantile )

In the end, the value is a little off from the exact version. I suspect I'm shifted over by one or some equally silly explanation, but maybe I'm missing something fundamental.

Calculate quantiles for large data

Answers (1)

Related Questions