Patrick
Patrick

Reputation: 32234

Summarise multi-dimensional array by month

I have a multi-dimensional array where the 3rd dimension represents time. For purposes of this question, let's use the ozone dataset from the plyr package:

> str(ozone)
 num [1:24, 1:24, 1:72] 260 258 258 254 252 252 250 248 248 248 ...
 - attr(*, "dimnames")=List of 3
  ..$ lat : chr [1:24] "-21.2" "-18.7" "-16.2" "-13.7" ...
  ..$ long: chr [1:24] "-113.8" "-111.3" "-108.8" "-106.3" ...
  ..$ time: chr [1:72] "1" "2" "3" "4" ...

From the documentation:

The data are monthly ozone averages on a very coarse 24 by 24 grid covering Central America, from Jan 1995 to Dec 2000. The data is stored in a 3d area with the first two dimensions representing latitude and longitude, and the third representing time.

What I would like to do is to create the monthly average for each of the lat/long cells. I can do this for a single lat/long combination with tapply like so:

> tapply(ozone[1,1,], rep(1:12, 6), mean)
       1        2        3        4        5        6        7        8        9       10       11       12 
264.6667 257.6667 255.0000 251.3333 257.6667 265.0000 274.0000 275.3333 277.0000 285.0000 283.0000 273.3333

but I am stumped on doing this over the entire array at once. apply will let me select the dimensions to operate over (MARGIN), tapply will let me use a factor to select slices (INDEX), but I need both.

I am open to suggestions but prefer to work with arrays rather than data frames due to the size and complexity of the data.


Two excellent answers below from GKi and G. Grothendieck, many thanks to both. I have put them through microbenchmark and these are the results:

> microbenchmark(
    GKi1 = apply(array(ozone, c(dim(ozone)[1:2], 12, dim(ozone)[3]/12), c(dimnames(ozone)[1:2], list(month=1:12, year=1995:2000))), 1:3, mean),
    GKi2 = simplify2array(lapply(split(dimnames(plyr::ozone)[[3]], 1:12), \(x) apply(plyr::ozone[,,x], 1:2, mean))),
    Grothendieck = apply(ozone, 1:2, month_mean),
  times=100)

Unit: milliseconds
         expr      min       lq     mean   median       uq       max neval
         GKi1 21.67954 22.99889 25.73446 23.89190 27.26137  46.26843   100
         GKi2 20.90931 22.90361 26.64572 23.88572 30.76128  45.16404   100
 Grothendieck 40.98800 43.26854 49.51313 44.73214 52.28114 266.12759   100

Upvotes: 2

Views: 73

Answers (2)

GKi
GKi

Reputation: 39667

You can divide the time dimension in month and year and use then apply.

x <- plyr::ozone
x <- array(x, c(dim(x)[1:2], 12, dim(x)[3]/12),
           c(dimnames(x)[1:2], list(month=1:12, year=1995:2000)))
#dim(x) <- c(dim(x)[1:2], 12, dim(x)[3]/12) #Alternative without names
. <- apply(x, 1:3, mean)
.[1,1,]
#       1        2        3        4        5        6        7        8 
#264.6667 257.6667 255.0000 251.3333 257.6667 265.0000 274.0000 275.3333 
#       9       10       11       12 
#277.0000 285.0000 283.0000 273.3333 

Another option can be.

. <- simplify2array(lapply(split(dimnames(plyr::ozone)[[3]], 1:12), \(x)
                    apply(plyr::ozone[,,x], 1:2, mean)))
.[1,1,]
#       1        2        3        4        5        6        7        8 
#264.6667 257.6667 255.0000 251.3333 257.6667 265.0000 274.0000 275.3333 
#       9       10       11       12 
#277.0000 285.0000 283.0000 273.3333 

Or using tapply inside apply (based on the grat answer of @g-grothendieck)

. <- apply(plyr::ozone, 1:2, tapply, rep(1:12, 6), mean)
.[,1,1]
#aperm(., c(2,3,1))[1,1,] #Alternative
#       1        2        3        4        5        6        7        8 
#264.6667 257.6667 255.0000 251.3333 257.6667 265.0000 274.0000 275.3333 
#       9       10       11       12 
#277.0000 285.0000 283.0000 273.3333 

Using by in apply.

. <- apply(plyr::ozone, 1:2, by, rep(1:12, 6), mean)
.[,1,1]
#       1        2        3        4        5        6        7        8 
#264.6667 257.6667 255.0000 251.3333 257.6667 265.0000 274.0000 275.3333 
#       9       10       11       12 
#277.0000 285.0000 283.0000 273.3333 

If speed matters rowMeans could be used.

. <- rowMeans(`dim<-`(plyr::ozone, c(dim(plyr::ozone)[1:2], 12,
  dim(plyr::ozone)[3]/12)), dims=3)
.[1,1,]
# [1] 264.6667 257.6667 255.0000 251.3333 257.6667 265.0000 274.0000 275.3333
# [9] 277.0000 285.0000 283.0000 273.3333

Benchmark:

set.seed(42)
a <- array(runif(24 * 24 * 72), c(24, 24, 72))

bench::mark(check = FALSE, #Some have attr and have not the same order
  applyDim = apply(`dim<-`(a, c(dim(a)[1:2], 12, dim(a)[3]/12)), 1:3, mean),
  split = simplify2array(lapply(split(seq_len(dim(a)[3]), 1:12), \(x)
                    apply(a[,,x], 1:2, mean))),
  applyTapply = apply(a, 1:2, tapply, rep_len(1:12, dim(a)[3]), mean),
  applyBy = apply(a, 1:2, by, rep_len(1:12, dim(a)[3]), mean),
  rowMeans = rowMeans(`dim<-`(a, c(dim(a)[1:2], 12, dim(a)[3]/12)), dims=3)
)
#  expression       min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 applyDim      25.5ms   29.2ms     33.4   864.33KB     41.2    17    21
#2 split         30.6ms   33.6ms     30.1   984.13KB     35.7    16    19
#3 applyTapply   36.2ms   38.8ms     22.6     2.52MB     35.9    12    19
#4 applyBy      177.1ms  179.1ms      5.56    2.71MB     35.2     3    19
#5 rowMeans     130.9µs  155.1µs   5149.    378.09KB     38.0  2575    19

Upvotes: 5

G. Grothendieck
G. Grothendieck

Reputation: 269654

Using the array defined reproducibly in the Note at the end we have the following using apply. The precise output was not defined but if you wanted a different order of dimensions then use aperm.

month_mean <- function(x) c(tapply(x, rep(1:12, each = 6), mean))
aa <- apply(a, 1:2, month_mean)

Now check the result

# check
all.equal(aa[, 4, 9], month_mean(a[4, 9, ]))
## [1] TRUE

# another check
for(i in 1:24) for(j in 1:24) {
  check <- all.equal(aa[, i, j], month_mean(a[i, j, ]))
  if (!check) stop("not equal")
}

Note

set.seed(123)
a <- array(runif(24 * 24 * 72), c(24, 24, 72))

Upvotes: 2

Related Questions