climsaver
climsaver

Reputation: 617

Calculate trend in array using linear regression

I have an array of the dimensions c(54,71,360) which contains climatalogical data. The first two dimensions describe the grid of the region, while the third one serves as time dimension. So in this case, there are 360 time steps (months).

Here is code to produce a sample array:

set.seed(5)
my_array <- array(sample(rnorm(100), 600, replace=T), dim= c(54,71,360))

Now I would like to calculate the trend of each grid cell. The trend is equal to the slope of the linear regression equation. This is why the calculation of the linear regression of every grid cell with the time needs to be performed. And this is exactly what I am struggeling with.

To clearly show what I wish to do, here is an example with one grid cell, which is taken from the array as a vector of the length 360:

grid_cell <- my_array[1,1,]

The linear regression of this vector with the time needs to be calculated. For that purpose, we create a simple time vector:

time_vec <- 1:360

Since I am only interested at the slope coefficient, it can be done this way:

trend <- lm(grid_cell ~ time_vec)$coefficients[2]

This leads to a value of 1.347029e-05 in this case.

I would like to do this for every grid cell of the array, so that the output is a matrix of the dimensions c(54,71), meaning one trend value for each grid cell.

I tried the following, which did not work:

trend_mat <- apply(my_array, 1:2, lm(my_array ~ time_vec)$coefficients[2])

I receive the error message:Error in model.frame.default: variable lengths differ.

This is kind of surprising, since both, the third dimension of the array and the time_vec are both of the length 360.

Anybody with an idea how to achieve this?

Of course I am also open for other solutions which may work totally differently, as long as they lead to the same result.

Upvotes: 1

Views: 520

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269854

The problems with the code in the question are that

  • the third argument of apply should be a function and the question's code provides an expression instead of a function.
  • it applies lm many times. We show how to do it applying lm only once and in the second alternative we don't use lm at all. this gives one and two order of magnitude speedups as shown in the Performance section below.

It is easier to illustrate if we use smaller data as shown in the Note at the end. To use it on your example just replace dims with the line shown in the commented out line in the Note.

1) First we reshape the array into a matrix, perform lm and then reshape it back. This invokes lm once rather than invoking it prod(dims[1:2]) times.

m <- t(matrix(a,,dim(a)[3]))
array(coef(lm(m ~ timevec))[2, ], dim(a)[1:2])
##            [,1]      [,2]      [,3]
## [1,]  0.2636792 0.5682025 -0.255538
## [2,] -0.4453307 0.2338086  0.254682

# check
coef(lm(a[1,1,] ~ timevec))[[2]]
## [1] 0.2636792
coef(lm(a[2,1,] ~ timevec))[[2]]
## [1] -0.4453307
coef(lm(a[1,2,] ~ timevec))[[2]]
## [1] 0.5682025
coef(lm(a[2,2,] ~ timevec))[[2]]
## [1] 0.2338086
coef(lm(a[1,3,] ~ timevec))[[2]]
## [1] -0.255538
coef(lm(a[2,3,] ~ timevec))[[2]]
## [1] 0.254682

2) Alternately, we can remove lm entirely by using the formula for the slope coefficient like this:

m <- t(matrix(a,,dim(a)[3]))
array(cov(m, timevec) / var(timevec), dims[1:2])
##            [,1]      [,2]      [,3]
## [1,]  0.2636792 0.5682025 -0.255538
## [2,] -0.4453307 0.2338086  0.254682

Performance

We see that the single lm runs about 8x faster than apply and eliminating lm runs about 230x times faster than apply. Because the apply is brutally slow on my laptop I only used 3 replications but if you have a faster machine or more patience you can increase it. The main conclusions are unlikely to change much though.

library(microbenchmark)

set.seed(5)

dims <- c(54,71,360)
a <- array(rnorm(prod(dims)), dims)
timevec <- seq_len(dim(a)[3])


microbenchmark(times = 3L,
  apply = apply(a, 1:2, function(x) coef(lm(x ~ timevec))[2]),
  lm = {  m <- t(matrix(a,,dim(a)[3]))
          array(coef(lm(m ~ timevec))[2, ], dim(a)[1:2]) 
  },
  cov =  {  m <- t(matrix(a,,dim(a)[3]))
            array(cov(m, timevec) / var(timevec), dims[1:2])
  })

giving:

Unit: milliseconds
  expr        min         lq        mean     median         uq        max neval cld
 apply 13446.7953 13523.6016 13605.25037 13600.4079 13684.4779 13768.5479     3   b
    lm   264.5883   275.7611   476.82077   286.9338   582.9370   878.9402     3  a 
   cov    56.9120    57.8830    58.71573    58.8540    59.6176    60.3812     3  a 

Note

Test data.

set.seed(5)

# dims <- c(54,71,360)
dims <- 2:4
a <- array(rnorm(prod(dims)), dims)
timevec <- seq_len(dim(a)[3])

Upvotes: 2

Rui Barradas
Rui Barradas

Reputation: 76495

There is a anonymous function missing in the question's regression code. Here I will use the new lambdas, introduced in R 4.1.0.
I also use the recommended extractor coef.

set.seed(5)
my_array <- array(sample(rnorm(100), 600, replace=T), dim= c(54,71,360))

time_vec <- 1:360
trend_mat <- apply(my_array, 1:2, \(x) coef(lm(x ~ time_vec))[2])

Upvotes: 1

Related Questions