Tim
Tim

Reputation: 7464

Packing the 2D data into same size bins

I want to pack data into bins of the same size, where each bin is average value of all the values of cases in this bin. This is straightforward with 1-dimensional data sorted into k=10 bins:

library(magrittr)

sample(1000) %>% sort %>% tapply(cut(., 10), mean)

What if I had 2-dimensional data, i.e. each bin would include values from a 2-dimensional space sorted by both x and y values. What about 3D...? Could you suggest a general approach?

Let's say a 2D data with two variables with levels named here letters and numbers for simplicity. This data, when aggregated, has only ten values like on the diagram below:

  1 2 3 4 5
a x . . x .
b . x . . .
c . x x x .
d . x x . x
e x . . . .

so every x is a group average of k observations - in [a, 1] we have k values that are both a and 1 and the values are averaged, so it is a tuple (mean(a), mean(1)).

To give another example: I would like to get as an output a matrix with something similar to hexbin plot, but with mean values rather then counts in cells.

hexbin plot example

Upvotes: 2

Views: 683

Answers (2)

r2evans
r2evans

Reputation: 160407

Assumptions

  • the source is an array of some dimensions, meaning we have a datum for each cell; this has not been tested on sparse matrices, though I don't see why it would not work in theory; and

  • when you say bins of the same size, I'm assuming you will accept "approximate" when the dimensions of the original matrix do not divide evenly.

Some perks of this implementation, as obscure or difficult-to-read as it may be:

  • I think it will work with arbitrary dimensionality of the source array; though I've tested it with some cases, I am not 100% confident that it is fool-proof; and

  • it will allow arbitrary summary functions, defaulting to mean; this might be useful when reducing a matrix but still desiring metrics of its central tendency (e.g., mean, median) and dispersion (e.g., sd, var, range, IQR). Obviously any number of functions can be used here, so long as they accept as their first argument an array of arbitrary dimensionality and nothing else.

The Code

reduceMatrix <- function(x, newdim, func = mean) {
    if (length(newdim) == 1)
        newdim <- rep(newdim, length(dim(x)))
    if (length(dim(x)) != length(newdim))
        stop('newdim must be of length 1 or the same length as dimensions of x')

    allCuts <- lapply(1:length(newdim), function(d) {
        tmp <- round(sapply(dim(x)[d], function(y) seq(1, 1 + y, len = 1 + newdim[d])),
                     digits = 0)
        mapply(seq, head(tmp, n = -1), tmp[-1] - 1, SIMPLIFY = FALSE)
    })

    newIndices <- lapply(newdim, function(d) seq(1, d))
    eg <- do.call(expand.grid, newIndices)

    f <- function(m, cuts, ...) func(do.call(`[`, c(list(m), mapply(`[`, cuts, list(...)))))
    ret <- do.call(mapply, c(list(FUN=f), list(list(x)), list(list(allCuts)), eg))

    array(ret, dim = newdim)
}

Simple 2D Matrix

dim2 <- c(8,8)
mtx2 <- array(1:prod(dim2), dim = dim2)
mtx2
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    9   17   25   33   41   49   57
## [2,]    2   10   18   26   34   42   50   58
## [3,]    3   11   19   27   35   43   51   59
## [4,]    4   12   20   28   36   44   52   60
## [5,]    5   13   21   29   37   45   53   61
## [6,]    6   14   22   30   38   46   54   62
## [7,]    7   15   23   31   39   47   55   63
## [8,]    8   16   24   32   40   48   56   64
reduceMatrix(mtx2, newdim = 2)
##      [,1] [,2]
## [1,] 14.5 46.5
## [2,] 18.5 50.5
matrix(c(mean(mtx2[1:4,1:4]), mean(mtx2[1:4,5:8]),
         mean(mtx2[5:8,1:4]), mean(mtx2[5:8,5:8])),
       nrow = 2, byrow = TRUE)
##      [,1] [,2]
## [1,] 14.5 46.5
## [2,] 18.5 50.5

A Third Dimension

dim3 <- c(4,4,16)
mtx3 <- array(1:prod(dim3), dim = dim3)
mtx3[,,1:2]
## , , 1
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
## , , 2
##      [,1] [,2] [,3] [,4]
## [1,]   17   21   25   29
## [2,]   18   22   26   30
## [3,]   19   23   27   31
## [4,]   20   24   28   32
reduceMatrix(mtx3, newdim = c(2, 4, 2))
## , , 1
##      [,1] [,2] [,3] [,4]
## [1,] 57.5 61.5 65.5 69.5
## [2,] 59.5 63.5 67.5 71.5
## , , 2
##       [,1]  [,2]  [,3]  [,4]
## [1,] 185.5 189.5 193.5 197.5
## [2,] 187.5 191.5 195.5 199.5
mean(mtx3[1:2, 4, 1:8])
## [1] 69.5

This can even reduce the depth from a 3D to a 2D (in this case):

reduceMatrix(mtx3, newdim = c(2, 4, 1))
## , , 1
##       [,1]  [,2]  [,3]  [,4]
## [1,] 121.5 125.5 129.5 133.5
## [2,] 123.5 127.5 131.5 135.5

Arbitrary Functions

reduceMatrix(mtx3, newdim = c(2, 4, 1), func = min)
## , , 1
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    3    7   11   15
reduceMatrix(mtx3, newdim = c(2, 4, 1), func = sd)
## , , 1
##          [,1]     [,2]     [,3]     [,4]
## [1,] 74.93825 74.93825 74.93825 74.93825
## [2,] 74.93825 74.93825 74.93825 74.93825

In this case, the results from sd are not surprising given the data's sequential nature ...

Much Larger Matrix

hugedim <- c(100, 100, 100, 100)
## this may take a few seconds ...
hugemtx <- array(rnorm(prod(hugedim)), dim = hugedim)
reduceMatrix(hugemtx, newdim = c(4, 4, 4, 4))[,,1,1]
##              [,1]          [,2]         [,3]          [,4]
## [1,] -0.001348164 -0.0007825642 -0.000369729 -0.0021077956
## [2,]  0.001736173  0.0023445047 -0.001259591 -0.0024143239
## [3,] -0.001820090 -0.0011529691 -0.001941874 -0.0005125225
## [4,]  0.002675196 -0.0009186329 -0.002009339 -0.0009181220
reduceMatrix(hugemtx, newdim = c(4, 4, 2, 1), func = sd)
## , , 1, 1
##          [,1]     [,2]     [,3]      [,4]
## [1,] 1.000269 1.000427 1.000454 0.9999282
## [2,] 1.000533 0.999885 1.000669 1.0003171
## [3,] 1.000580 1.000232 1.000167 0.9999460
## [4,] 1.000744 1.000150 1.000510 0.9994936
## , , 2, 1
##           [,1]      [,2]     [,3]      [,4]
## [1,] 0.9997954 0.9991735 1.000129 0.9996085
## [2,] 0.9999844 1.0002422 1.000016 0.9999593
## [3,] 1.0002473 1.0001445 1.000639 0.9998599
## [4,] 1.0005914 0.9993628 0.999104 1.0002434

(I realize on re-reading your problem statement for the seventeenth time that your output is not what I have been envisioning. I wonder if this can be adapted ...)

Upvotes: 1

JJFord3
JJFord3

Reputation: 1985

Hmmm... seems like you want mutually exclusive clusters that are rank-ordered.

I'm sure it can be done for 2D & 3D, but I'm not sure that the solutions would be necessarily unique unless there was a hierarchy to the variables.

As an example of the uniqueness problem, consider a 2D space with uniformly distributed data centered at the origin with k = 2. Any pair of line that equally divides the data into 4 bins (i.e. any two lines through the origin that are orthogonal) would create a 4 bin case, but the rank ordering is poorly defined unless we have some weighting for each of the two variables.

However, if you have the relative weighting (or importance) for each variable... any projection of the 2D space into 1D space could be used to define the rank and an optimal solution can be defined.

Upvotes: 0

Related Questions