Reputation: 617
I have an array of the dimensions c(54,71,360)
which contains climatalogical data. The first two dimensions describe the grid of the region, while the third one serves as time dimension. So in this case, there are 360 time steps (months).
Here is code to produce a sample array:
set.seed(5)
my_array <- array(sample(rnorm(100), 600, replace=T), dim= c(54,71,360))
Now I would like to calculate the trend of each grid cell. The trend is equal to the slope of the linear regression equation. This is why the calculation of the linear regression of every grid cell with the time needs to be performed. And this is exactly what I am struggeling with.
To clearly show what I wish to do, here is an example with one grid cell, which is taken from the array as a vector of the length 360:
grid_cell <- my_array[1,1,]
The linear regression of this vector with the time needs to be calculated. For that purpose, we create a simple time vector:
time_vec <- 1:360
Since I am only interested at the slope coefficient, it can be done this way:
trend <- lm(grid_cell ~ time_vec)$coefficients[2]
This leads to a value of 1.347029e-05
in this case.
I would like to do this for every grid cell of the array, so that the output is a matrix of the dimensions c(54,71)
, meaning one trend value for each grid cell.
I tried the following, which did not work:
trend_mat <- apply(my_array, 1:2, lm(my_array ~ time_vec)$coefficients[2])
I receive the error message:Error in model.frame.default: variable lengths differ.
This is kind of surprising, since both, the third dimension of the array and the time_vec
are both of the length 360.
Anybody with an idea how to achieve this?
Of course I am also open for other solutions which may work totally differently, as long as they lead to the same result.
Upvotes: 1
Views: 520
Reputation: 269854
The problems with the code in the question are that
apply
should be a function and the question's code provides an expression instead of a function.lm
many times. We show how to do it applying lm
only once and in the second alternative we don't use lm
at all. this gives one and two order of magnitude speedups as shown in the Performance section below.It is easier to illustrate if we use smaller data as shown in the Note at the end. To use it on your example just replace dims
with the line shown in the commented out line in the Note.
1) First we reshape the array into a matrix, perform lm and then reshape it back. This invokes lm once rather than invoking it prod(dims[1:2])
times.
m <- t(matrix(a,,dim(a)[3]))
array(coef(lm(m ~ timevec))[2, ], dim(a)[1:2])
## [,1] [,2] [,3]
## [1,] 0.2636792 0.5682025 -0.255538
## [2,] -0.4453307 0.2338086 0.254682
# check
coef(lm(a[1,1,] ~ timevec))[[2]]
## [1] 0.2636792
coef(lm(a[2,1,] ~ timevec))[[2]]
## [1] -0.4453307
coef(lm(a[1,2,] ~ timevec))[[2]]
## [1] 0.5682025
coef(lm(a[2,2,] ~ timevec))[[2]]
## [1] 0.2338086
coef(lm(a[1,3,] ~ timevec))[[2]]
## [1] -0.255538
coef(lm(a[2,3,] ~ timevec))[[2]]
## [1] 0.254682
2) Alternately, we can remove lm
entirely by using the formula for the slope coefficient like this:
m <- t(matrix(a,,dim(a)[3]))
array(cov(m, timevec) / var(timevec), dims[1:2])
## [,1] [,2] [,3]
## [1,] 0.2636792 0.5682025 -0.255538
## [2,] -0.4453307 0.2338086 0.254682
We see that the single lm runs about 8x faster than apply and eliminating lm runs about 230x times faster than apply
. Because the apply
is brutally slow on my laptop I only used 3 replications but if you have a faster machine or more patience you can increase it. The main conclusions are unlikely to change much though.
library(microbenchmark)
set.seed(5)
dims <- c(54,71,360)
a <- array(rnorm(prod(dims)), dims)
timevec <- seq_len(dim(a)[3])
microbenchmark(times = 3L,
apply = apply(a, 1:2, function(x) coef(lm(x ~ timevec))[2]),
lm = { m <- t(matrix(a,,dim(a)[3]))
array(coef(lm(m ~ timevec))[2, ], dim(a)[1:2])
},
cov = { m <- t(matrix(a,,dim(a)[3]))
array(cov(m, timevec) / var(timevec), dims[1:2])
})
giving:
Unit: milliseconds
expr min lq mean median uq max neval cld
apply 13446.7953 13523.6016 13605.25037 13600.4079 13684.4779 13768.5479 3 b
lm 264.5883 275.7611 476.82077 286.9338 582.9370 878.9402 3 a
cov 56.9120 57.8830 58.71573 58.8540 59.6176 60.3812 3 a
Test data.
set.seed(5)
# dims <- c(54,71,360)
dims <- 2:4
a <- array(rnorm(prod(dims)), dims)
timevec <- seq_len(dim(a)[3])
Upvotes: 2
Reputation: 76495
There is a anonymous function missing in the question's regression code. Here I will use the new lambdas, introduced in R 4.1.0.
I also use the recommended extractor coef
.
set.seed(5)
my_array <- array(sample(rnorm(100), 600, replace=T), dim= c(54,71,360))
time_vec <- 1:360
trend_mat <- apply(my_array, 1:2, \(x) coef(lm(x ~ time_vec))[2])
Upvotes: 1