Packing the 2D data into same size bins

Question

I want to pack data into bins of the same size, where each bin is average value of all the values of cases in this bin. This is straightforward with 1-dimensional data sorted into k=10 bins:

library(magrittr)

sample(1000) %>% sort %>% tapply(cut(., 10), mean)

What if I had 2-dimensional data, i.e. each bin would include values from a 2-dimensional space sorted by both x and y values. What about 3D...? Could you suggest a general approach?

Let's say a 2D data with two variables with levels named here letters and numbers for simplicity. This data, when aggregated, has only ten values like on the diagram below:

  1 2 3 4 5
a x . . x .
b . x . . .
c . x x x .
d . x x . x
e x . . . .

so every x is a group average of k observations - in [a, 1] we have k values that are both a and 1 and the values are averaged, so it is a tuple (mean(a), mean(1)).

To give another example: I would like to get as an output a matrix with something similar to hexbin plot, but with mean values rather then counts in cells.

hexbin plot example

r2evans · Accepted Answer

Assumptions

the source is an array of some dimensions, meaning we have a datum for each cell; this has not been tested on sparse matrices, though I don't see why it would not work in theory; and
when you say bins of the same size, I'm assuming you will accept "approximate" when the dimensions of the original matrix do not divide evenly.

Some perks of this implementation, as obscure or difficult-to-read as it may be:

I think it will work with arbitrary dimensionality of the source array; though I've tested it with some cases, I am not 100% confident that it is fool-proof; and
it will allow arbitrary summary functions, defaulting to mean; this might be useful when reducing a matrix but still desiring metrics of its central tendency (e.g., mean, median) and dispersion (e.g., sd, var, range, IQR). Obviously any number of functions can be used here, so long as they accept as their first argument an array of arbitrary dimensionality and nothing else.

The Code

reduceMatrix <- function(x, newdim, func = mean) {
    if (length(newdim) == 1)
        newdim <- rep(newdim, length(dim(x)))
    if (length(dim(x)) != length(newdim))
        stop('newdim must be of length 1 or the same length as dimensions of x')

    allCuts <- lapply(1:length(newdim), function(d) {
        tmp <- round(sapply(dim(x)[d], function(y) seq(1, 1 + y, len = 1 + newdim[d])),
                     digits = 0)
        mapply(seq, head(tmp, n = -1), tmp[-1] - 1, SIMPLIFY = FALSE)
    })

    newIndices <- lapply(newdim, function(d) seq(1, d))
    eg <- do.call(expand.grid, newIndices)

    f <- function(m, cuts, ...) func(do.call(`[`, c(list(m), mapply(`[`, cuts, list(...)))))
    ret <- do.call(mapply, c(list(FUN=f), list(list(x)), list(list(allCuts)), eg))

    array(ret, dim = newdim)
}

Simple 2D Matrix

dim2 <- c(8,8)
mtx2 <- array(1:prod(dim2), dim = dim2)
mtx2
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    9   17   25   33   41   49   57
## [2,]    2   10   18   26   34   42   50   58
## [3,]    3   11   19   27   35   43   51   59
## [4,]    4   12   20   28   36   44   52   60
## [5,]    5   13   21   29   37   45   53   61
## [6,]    6   14   22   30   38   46   54   62
## [7,]    7   15   23   31   39   47   55   63
## [8,]    8   16   24   32   40   48   56   64
reduceMatrix(mtx2, newdim = 2)
##      [,1] [,2]
## [1,] 14.5 46.5
## [2,] 18.5 50.5
matrix(c(mean(mtx2[1:4,1:4]), mean(mtx2[1:4,5:8]),
         mean(mtx2[5:8,1:4]), mean(mtx2[5:8,5:8])),
       nrow = 2, byrow = TRUE)
##      [,1] [,2]
## [1,] 14.5 46.5
## [2,] 18.5 50.5

A Third Dimension

dim3 <- c(4,4,16)
mtx3 <- array(1:prod(dim3), dim = dim3)
mtx3[,,1:2]
## , , 1
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
## , , 2
##      [,1] [,2] [,3] [,4]
## [1,]   17   21   25   29
## [2,]   18   22   26   30
## [3,]   19   23   27   31
## [4,]   20   24   28   32
reduceMatrix(mtx3, newdim = c(2, 4, 2))
## , , 1
##      [,1] [,2] [,3] [,4]
## [1,] 57.5 61.5 65.5 69.5
## [2,] 59.5 63.5 67.5 71.5
## , , 2
##       [,1]  [,2]  [,3]  [,4]
## [1,] 185.5 189.5 193.5 197.5
## [2,] 187.5 191.5 195.5 199.5
mean(mtx3[1:2, 4, 1:8])
## [1] 69.5

This can even reduce the depth from a 3D to a 2D (in this case):

reduceMatrix(mtx3, newdim = c(2, 4, 1))
## , , 1
##       [,1]  [,2]  [,3]  [,4]
## [1,] 121.5 125.5 129.5 133.5
## [2,] 123.5 127.5 131.5 135.5

Arbitrary Functions

reduceMatrix(mtx3, newdim = c(2, 4, 1), func = min)
## , , 1
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    3    7   11   15
reduceMatrix(mtx3, newdim = c(2, 4, 1), func = sd)
## , , 1
##          [,1]     [,2]     [,3]     [,4]
## [1,] 74.93825 74.93825 74.93825 74.93825
## [2,] 74.93825 74.93825 74.93825 74.93825

In this case, the results from sd are not surprising given the data's sequential nature ...

Much Larger Matrix

hugedim <- c(100, 100, 100, 100)
## this may take a few seconds ...
hugemtx <- array(rnorm(prod(hugedim)), dim = hugedim)
reduceMatrix(hugemtx, newdim = c(4, 4, 4, 4))[,,1,1]
##              [,1]          [,2]         [,3]          [,4]
## [1,] -0.001348164 -0.0007825642 -0.000369729 -0.0021077956
## [2,]  0.001736173  0.0023445047 -0.001259591 -0.0024143239
## [3,] -0.001820090 -0.0011529691 -0.001941874 -0.0005125225
## [4,]  0.002675196 -0.0009186329 -0.002009339 -0.0009181220
reduceMatrix(hugemtx, newdim = c(4, 4, 2, 1), func = sd)
## , , 1, 1
##          [,1]     [,2]     [,3]      [,4]
## [1,] 1.000269 1.000427 1.000454 0.9999282
## [2,] 1.000533 0.999885 1.000669 1.0003171
## [3,] 1.000580 1.000232 1.000167 0.9999460
## [4,] 1.000744 1.000150 1.000510 0.9994936
## , , 2, 1
##           [,1]      [,2]     [,3]      [,4]
## [1,] 0.9997954 0.9991735 1.000129 0.9996085
## [2,] 0.9999844 1.0002422 1.000016 0.9999593
## [3,] 1.0002473 1.0001445 1.000639 0.9998599
## [4,] 1.0005914 0.9993628 0.999104 1.0002434

(I realize on re-reading your problem statement for the seventeenth time that your output is not what I have been envisioning. I wonder if this can be adapted ...)