Reputation: 187
I would like to use aggregate
function from the terra
R
package to aggregate raster with a quantiles approach as aggregation function. Here below, I used the quantile
function from R base
to compute 50th percentile (i.e. the median) using a raster from the local package directory. I chose the 50th percentile for comparison with median but my goal is indeed to compute other quantile(s)...
library(terra)
# load elevation coming with the terra pakage
r <- rast( system.file("ex/elev.tif", package="terra") )
plot(r)
# number of iteration
n_it <- 20
# with a custom function
start_time <- Sys.time()
for (i in 1:n_it){
ra <- aggregate(r, 2 , fun = function(x) quantile(x, probs = .5, na.rm = T))
}
end_time <- Sys.time()
It took my computer approx. 6 secs to do it 20 times.
print(end_time-start_time)
Time difference of 6.052727 secs
WHen I ran the same aggregate
run with the median built-in function, it took approx. 40 times less time to perform the very same 20 iterations!
# with a built-in function
start_time <- Sys.time()
for (i in 1:n_it){
ra <- aggregate(r, 2 , fun = median)
}
end_time <- Sys.time()
print(end_time-start_time)
Time difference of 0.1456101 secs
As I would like to compute other percentiles than the 50th, could someone provide some advises to speed up aggregate
when using custom functions?
Upvotes: 6
Views: 1562
Reputation: 187
Based on the replies, I tested two options: the use of tdigest
package and the built-in parallelization routine from terra
package (cores
parameter). The t-Digest construction algorithm, by Dunning et al., (2019), uses a variant of 1-dimensional k-means clustering to produce a very compact data structure that allows accurate estimation of quantiles. I recommend to use the tquantile
function, which reduced by third the processing time with the tested dataset.
For those who were thinking about foreach
parallelization, there are no simple solution to run foreach loop with terra objects. For such tasks, I'm still using the good old raster package. It is a planned update but not on short term - see here. More details below.
library(terra)
# load elevation coming with the terra pakage
r <- rast( system.file("ex/elev.tif", package="terra") )
plot(r)
# number of iteration
n_it <- 20
# With `stats::quantile()` function
start_time <- Sys.time()
for (i in 1:n_it){
ra <- aggregate(r, 2 , fun = function(x) quantile(x, probs = .5, na.rm = T))
}
end_time <- Sys.time()
print(end_time-start_time)
Time difference of 6.013551 secs
tdigest::tquantile()
library(tdigest)
start_time_tdigest <- Sys.time()
for (i in 1:n_it){
ra_tdigest <- aggregate(r, 2 , fun = function(x) tquantile(tdigest(na.omit(x)), probs = .5))
}
end_time_tdigest <- Sys.time()
print(end_time_tdigest-start_time_tdigest)
Time difference of 1.922526 secs
As suspected by Martin, the use of the cores
parameter in the terra:aggregate
function did not improve the processing time:
stats::quantile()
+ parallelizationstart_time_parallel <- Sys.time()
for (i in 1:n_it){
ra_tdigest_parallel <- aggregate(r, 2 , fun = function(x) quantile(x, probs = .5, na.rm = T), cores = 2)
}
end_time_parallel <- Sys.time()
print(end_time_parallel-start_time_parallel)
Time difference of 8.537751 secs
tdigest::tquantile()
+ parallelizationtdigest_quantil_terra <- function(x) {
require(tdigest)
tquantile(tdigest(na.omit(x)), probs = .5) }
start_time_tdigest_parallel <- Sys.time() for (i in 1:n_it){
ra_tdigest_parallel <- aggregate(r, 2 ,
fun = function(x, ff) ff(x), cores = 2 ,
ff = tdigest_quantil_terra)
}
end_time_tdigest_parallel <- Sys.time()
print(end_time_tdigest_parallel-start_time_tdigest_parallel)
Time difference of 7.520231 secs
In a nutshell:
1 tdigest 1.922526 secs
2 base_quantile 6.013551 secs
3 tdigest_parallel 7.520231 secs
4 base_quantile_parallel 8.537751 secs
Upvotes: 6
Reputation: 9658
It is not that aggregate()
is slow per se when using custom functions. Rather it is more expensive to use quantile()
instead of median()
to obtain the median. This is presumably due to the cost of the computation itself (terra uses a C++ implementation to compute the median that is faster than for an arbitrary quantile) and also because quantile()
performs more checks and thus calls more additional functions in the process. This higher computational cost adds up when the operation is performed multiple times as done by aggregate
.
If you have a much larger raster it could be beneficial to distribute computations across multiple cores using the cores
argument, see ?terra::aggregate
. However, I think this is not an option for the elev
data as the overhead is too large.
If you want to call aggregate
for many different probs
you could parallelize the loop, e.g. using the foreach package.
Upvotes: 4