Reputation: 32234
I have a multi-dimensional array where the 3rd dimension represents time. For purposes of this question, let's use the ozone
dataset from the plyr
package:
> str(ozone)
num [1:24, 1:24, 1:72] 260 258 258 254 252 252 250 248 248 248 ...
- attr(*, "dimnames")=List of 3
..$ lat : chr [1:24] "-21.2" "-18.7" "-16.2" "-13.7" ...
..$ long: chr [1:24] "-113.8" "-111.3" "-108.8" "-106.3" ...
..$ time: chr [1:72] "1" "2" "3" "4" ...
From the documentation:
The data are monthly ozone averages on a very coarse 24 by 24 grid covering Central America, from Jan 1995 to Dec 2000. The data is stored in a 3d area with the first two dimensions representing latitude and longitude, and the third representing time.
What I would like to do is to create the monthly average for each of the lat/long cells. I can do this for a single lat/long combination with tapply
like so:
> tapply(ozone[1,1,], rep(1:12, 6), mean)
1 2 3 4 5 6 7 8 9 10 11 12
264.6667 257.6667 255.0000 251.3333 257.6667 265.0000 274.0000 275.3333 277.0000 285.0000 283.0000 273.3333
but I am stumped on doing this over the entire array at once. apply
will let me select the dimensions to operate over (MARGIN
), tapply
will let me use a factor to select slices (INDEX
), but I need both.
I am open to suggestions but prefer to work with arrays rather than data frames due to the size and complexity of the data.
Two excellent answers below from GKi and G. Grothendieck, many thanks to both. I have put them through microbenchmark
and these are the results:
> microbenchmark(
GKi1 = apply(array(ozone, c(dim(ozone)[1:2], 12, dim(ozone)[3]/12), c(dimnames(ozone)[1:2], list(month=1:12, year=1995:2000))), 1:3, mean),
GKi2 = simplify2array(lapply(split(dimnames(plyr::ozone)[[3]], 1:12), \(x) apply(plyr::ozone[,,x], 1:2, mean))),
Grothendieck = apply(ozone, 1:2, month_mean),
times=100)
Unit: milliseconds
expr min lq mean median uq max neval
GKi1 21.67954 22.99889 25.73446 23.89190 27.26137 46.26843 100
GKi2 20.90931 22.90361 26.64572 23.88572 30.76128 45.16404 100
Grothendieck 40.98800 43.26854 49.51313 44.73214 52.28114 266.12759 100
Upvotes: 2
Views: 73
Reputation: 39667
You can divide the time dimension in month and year and use then apply
.
x <- plyr::ozone
x <- array(x, c(dim(x)[1:2], 12, dim(x)[3]/12),
c(dimnames(x)[1:2], list(month=1:12, year=1995:2000)))
#dim(x) <- c(dim(x)[1:2], 12, dim(x)[3]/12) #Alternative without names
. <- apply(x, 1:3, mean)
.[1,1,]
# 1 2 3 4 5 6 7 8
#264.6667 257.6667 255.0000 251.3333 257.6667 265.0000 274.0000 275.3333
# 9 10 11 12
#277.0000 285.0000 283.0000 273.3333
Another option can be.
. <- simplify2array(lapply(split(dimnames(plyr::ozone)[[3]], 1:12), \(x)
apply(plyr::ozone[,,x], 1:2, mean)))
.[1,1,]
# 1 2 3 4 5 6 7 8
#264.6667 257.6667 255.0000 251.3333 257.6667 265.0000 274.0000 275.3333
# 9 10 11 12
#277.0000 285.0000 283.0000 273.3333
Or using tapply
inside apply
(based on the grat answer of @g-grothendieck)
. <- apply(plyr::ozone, 1:2, tapply, rep(1:12, 6), mean)
.[,1,1]
#aperm(., c(2,3,1))[1,1,] #Alternative
# 1 2 3 4 5 6 7 8
#264.6667 257.6667 255.0000 251.3333 257.6667 265.0000 274.0000 275.3333
# 9 10 11 12
#277.0000 285.0000 283.0000 273.3333
Using by
in apply
.
. <- apply(plyr::ozone, 1:2, by, rep(1:12, 6), mean)
.[,1,1]
# 1 2 3 4 5 6 7 8
#264.6667 257.6667 255.0000 251.3333 257.6667 265.0000 274.0000 275.3333
# 9 10 11 12
#277.0000 285.0000 283.0000 273.3333
If speed matters rowMeans
could be used.
. <- rowMeans(`dim<-`(plyr::ozone, c(dim(plyr::ozone)[1:2], 12,
dim(plyr::ozone)[3]/12)), dims=3)
.[1,1,]
# [1] 264.6667 257.6667 255.0000 251.3333 257.6667 265.0000 274.0000 275.3333
# [9] 277.0000 285.0000 283.0000 273.3333
Benchmark:
set.seed(42)
a <- array(runif(24 * 24 * 72), c(24, 24, 72))
bench::mark(check = FALSE, #Some have attr and have not the same order
applyDim = apply(`dim<-`(a, c(dim(a)[1:2], 12, dim(a)[3]/12)), 1:3, mean),
split = simplify2array(lapply(split(seq_len(dim(a)[3]), 1:12), \(x)
apply(a[,,x], 1:2, mean))),
applyTapply = apply(a, 1:2, tapply, rep_len(1:12, dim(a)[3]), mean),
applyBy = apply(a, 1:2, by, rep_len(1:12, dim(a)[3]), mean),
rowMeans = rowMeans(`dim<-`(a, c(dim(a)[1:2], 12, dim(a)[3]/12)), dims=3)
)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
#1 applyDim 25.5ms 29.2ms 33.4 864.33KB 41.2 17 21
#2 split 30.6ms 33.6ms 30.1 984.13KB 35.7 16 19
#3 applyTapply 36.2ms 38.8ms 22.6 2.52MB 35.9 12 19
#4 applyBy 177.1ms 179.1ms 5.56 2.71MB 35.2 3 19
#5 rowMeans 130.9µs 155.1µs 5149. 378.09KB 38.0 2575 19
Upvotes: 5
Reputation: 269654
Using the array defined reproducibly in the Note at the end we have the following using apply
. The precise output was not defined but if you wanted a different order of dimensions then use aperm
.
month_mean <- function(x) c(tapply(x, rep(1:12, each = 6), mean))
aa <- apply(a, 1:2, month_mean)
Now check the result
# check
all.equal(aa[, 4, 9], month_mean(a[4, 9, ]))
## [1] TRUE
# another check
for(i in 1:24) for(j in 1:24) {
check <- all.equal(aa[, i, j], month_mean(a[i, j, ]))
if (!check) stop("not equal")
}
set.seed(123)
a <- array(runif(24 * 24 * 72), c(24, 24, 72))
Upvotes: 2