Adrien
Adrien

Reputation: 187

Terra R - Speed up aggregate() of raster data with custom function

I would like to use aggregate function from the terra R package to aggregate raster with a quantiles approach as aggregation function. Here below, I used the quantile function from R base to compute 50th percentile (i.e. the median) using a raster from the local package directory. I chose the 50th percentile for comparison with median but my goal is indeed to compute other quantile(s)...

library(terra)
# load elevation coming with the terra pakage
r <- rast( system.file("ex/elev.tif", package="terra") )

plot(r)

# number of iteration
n_it <- 20

# with a custom function
start_time <- Sys.time()
for (i in 1:n_it){
ra <- aggregate(r, 2 , fun = function(x) quantile(x, probs = .5, na.rm = T))
}
end_time <- Sys.time()

It took my computer approx. 6 secs to do it 20 times.

print(end_time-start_time)

Time difference of 6.052727 secs

WHen I ran the same aggregate run with the median built-in function, it took approx. 40 times less time to perform the very same 20 iterations!

# with a built-in function
start_time <- Sys.time()
for (i in 1:n_it){
  ra <- aggregate(r, 2 , fun = median)
}
end_time <- Sys.time()
print(end_time-start_time)

Time difference of 0.1456101 secs

As I would like to compute other percentiles than the 50th, could someone provide some advises to speed up aggregate when using custom functions?

Upvotes: 6

Views: 1562

Answers (2)

Adrien
Adrien

Reputation: 187

Based on the replies, I tested two options: the use of tdigest package and the built-in parallelization routine from terra package (cores parameter). The t-Digest construction algorithm, by Dunning et al., (2019), uses a variant of 1-dimensional k-means clustering to produce a very compact data structure that allows accurate estimation of quantiles. I recommend to use the tquantile function, which reduced by third the processing time with the tested dataset.

For those who were thinking about foreach parallelization, there are no simple solution to run foreach loop with terra objects. For such tasks, I'm still using the good old raster package. It is a planned update but not on short term - see here. More details below.

Toy dataset

library(terra)
# load elevation coming with the terra pakage
r <- rast( system.file("ex/elev.tif", package="terra") )

plot(r)

# number of iteration
n_it <- 20

# With `stats::quantile()` function
start_time <- Sys.time()
for (i in 1:n_it){
  ra <- aggregate(r, 2 , fun = function(x) quantile(x, probs = .5, na.rm = T))
}
end_time <- Sys.time()
print(end_time-start_time)

Time difference of 6.013551 secs

With tdigest::tquantile()

library(tdigest)

start_time_tdigest <- Sys.time()
for (i in 1:n_it){
  ra_tdigest <- aggregate(r, 2 , fun = function(x) tquantile(tdigest(na.omit(x)), probs = .5))
  }
end_time_tdigest <- Sys.time()
print(end_time_tdigest-start_time_tdigest)

Time difference of 1.922526 secs

As suspected by Martin, the use of the cores parameter in the terra:aggregate function did not improve the processing time:

stats::quantile() + parallelization

start_time_parallel <- Sys.time() 
for (i in 1:n_it){
ra_tdigest_parallel <- aggregate(r, 2 , fun = function(x) quantile(x, probs = .5, na.rm = T), cores = 2) 
} 
end_time_parallel <- Sys.time() 
print(end_time_parallel-start_time_parallel)

Time difference of 8.537751 secs

tdigest::tquantile() + parallelization

tdigest_quantil_terra <- function(x) {   
require(tdigest)   
tquantile(tdigest(na.omit(x)), probs = .5) }
        
start_time_tdigest_parallel <- Sys.time() for (i in 1:n_it){
ra_tdigest_parallel <- aggregate(r, 2 , 
fun = function(x, ff) ff(x), cores = 2 , 
ff = tdigest_quantil_terra)     
} 

end_time_tdigest_parallel <- Sys.time() 
print(end_time_tdigest_parallel-start_time_tdigest_parallel)

Time difference of 7.520231 secs

In a nutshell:

1 tdigest 1.922526 secs

2 base_quantile 6.013551 secs

3 tdigest_parallel 7.520231 secs

4 base_quantile_parallel 8.537751 secs

Upvotes: 6

Martin C. Arnold
Martin C. Arnold

Reputation: 9658

It is not that aggregate() is slow per se when using custom functions. Rather it is more expensive to use quantile() instead of median() to obtain the median. This is presumably due to the cost of the computation itself (terra uses a C++ implementation to compute the median that is faster than for an arbitrary quantile) and also because quantile() performs more checks and thus calls more additional functions in the process. This higher computational cost adds up when the operation is performed multiple times as done by aggregate.

If you have a much larger raster it could be beneficial to distribute computations across multiple cores using the cores argument, see ?terra::aggregate. However, I think this is not an option for the elev data as the overhead is too large.

If you want to call aggregate for many different probs you could parallelize the loop, e.g. using the foreach package.

Upvotes: 4

Related Questions