Reputation: 7464
I want to pack data into bins of the same size, where each bin is average value of all the values of cases in this bin. This is straightforward with 1-dimensional data sorted into k=10
bins:
library(magrittr)
sample(1000) %>% sort %>% tapply(cut(., 10), mean)
What if I had 2-dimensional data, i.e. each bin would include values from a 2-dimensional space sorted by both x
and y
values. What about 3D...? Could you suggest a general approach?
Let's say a 2D data with two variables with levels named here letters
and numbers
for simplicity. This data, when aggregated, has only ten values like on the diagram below:
1 2 3 4 5
a x . . x .
b . x . . .
c . x x x .
d . x x . x
e x . . . .
so every x
is a group average of k
observations - in [a, 1]
we have k
values that are both a
and 1
and the values are averaged, so it is a tuple (mean(a), mean(1))
.
To give another example: I would like to get as an output a matrix with something similar to hexbin
plot, but with mean values rather then counts in cells.
Upvotes: 2
Views: 683
Reputation: 160407
the source is an array of some dimensions, meaning we have a datum for each cell; this has not been tested on sparse matrices, though I don't see why it would not work in theory; and
when you say bins of the same size, I'm assuming you will accept "approximate" when the dimensions of the original matrix do not divide evenly.
Some perks of this implementation, as obscure or difficult-to-read as it may be:
I think it will work with arbitrary dimensionality of the source array; though I've tested it with some cases, I am not 100% confident that it is fool-proof; and
it will allow arbitrary summary functions, defaulting to mean
; this might be useful when reducing a matrix but still desiring metrics of its central tendency (e.g., mean
, median
) and dispersion (e.g., sd
, var
, range
, IQR
). Obviously any number of functions can be used here, so long as they accept as their first argument an array of arbitrary dimensionality and nothing else.
reduceMatrix <- function(x, newdim, func = mean) {
if (length(newdim) == 1)
newdim <- rep(newdim, length(dim(x)))
if (length(dim(x)) != length(newdim))
stop('newdim must be of length 1 or the same length as dimensions of x')
allCuts <- lapply(1:length(newdim), function(d) {
tmp <- round(sapply(dim(x)[d], function(y) seq(1, 1 + y, len = 1 + newdim[d])),
digits = 0)
mapply(seq, head(tmp, n = -1), tmp[-1] - 1, SIMPLIFY = FALSE)
})
newIndices <- lapply(newdim, function(d) seq(1, d))
eg <- do.call(expand.grid, newIndices)
f <- function(m, cuts, ...) func(do.call(`[`, c(list(m), mapply(`[`, cuts, list(...)))))
ret <- do.call(mapply, c(list(FUN=f), list(list(x)), list(list(allCuts)), eg))
array(ret, dim = newdim)
}
dim2 <- c(8,8)
mtx2 <- array(1:prod(dim2), dim = dim2)
mtx2
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 9 17 25 33 41 49 57
## [2,] 2 10 18 26 34 42 50 58
## [3,] 3 11 19 27 35 43 51 59
## [4,] 4 12 20 28 36 44 52 60
## [5,] 5 13 21 29 37 45 53 61
## [6,] 6 14 22 30 38 46 54 62
## [7,] 7 15 23 31 39 47 55 63
## [8,] 8 16 24 32 40 48 56 64
reduceMatrix(mtx2, newdim = 2)
## [,1] [,2]
## [1,] 14.5 46.5
## [2,] 18.5 50.5
matrix(c(mean(mtx2[1:4,1:4]), mean(mtx2[1:4,5:8]),
mean(mtx2[5:8,1:4]), mean(mtx2[5:8,5:8])),
nrow = 2, byrow = TRUE)
## [,1] [,2]
## [1,] 14.5 46.5
## [2,] 18.5 50.5
dim3 <- c(4,4,16)
mtx3 <- array(1:prod(dim3), dim = dim3)
mtx3[,,1:2]
## , , 1
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
## , , 2
## [,1] [,2] [,3] [,4]
## [1,] 17 21 25 29
## [2,] 18 22 26 30
## [3,] 19 23 27 31
## [4,] 20 24 28 32
reduceMatrix(mtx3, newdim = c(2, 4, 2))
## , , 1
## [,1] [,2] [,3] [,4]
## [1,] 57.5 61.5 65.5 69.5
## [2,] 59.5 63.5 67.5 71.5
## , , 2
## [,1] [,2] [,3] [,4]
## [1,] 185.5 189.5 193.5 197.5
## [2,] 187.5 191.5 195.5 199.5
mean(mtx3[1:2, 4, 1:8])
## [1] 69.5
This can even reduce the depth from a 3D to a 2D (in this case):
reduceMatrix(mtx3, newdim = c(2, 4, 1))
## , , 1
## [,1] [,2] [,3] [,4]
## [1,] 121.5 125.5 129.5 133.5
## [2,] 123.5 127.5 131.5 135.5
reduceMatrix(mtx3, newdim = c(2, 4, 1), func = min)
## , , 1
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 3 7 11 15
reduceMatrix(mtx3, newdim = c(2, 4, 1), func = sd)
## , , 1
## [,1] [,2] [,3] [,4]
## [1,] 74.93825 74.93825 74.93825 74.93825
## [2,] 74.93825 74.93825 74.93825 74.93825
In this case, the results from sd
are not surprising given the data's sequential nature ...
hugedim <- c(100, 100, 100, 100)
## this may take a few seconds ...
hugemtx <- array(rnorm(prod(hugedim)), dim = hugedim)
reduceMatrix(hugemtx, newdim = c(4, 4, 4, 4))[,,1,1]
## [,1] [,2] [,3] [,4]
## [1,] -0.001348164 -0.0007825642 -0.000369729 -0.0021077956
## [2,] 0.001736173 0.0023445047 -0.001259591 -0.0024143239
## [3,] -0.001820090 -0.0011529691 -0.001941874 -0.0005125225
## [4,] 0.002675196 -0.0009186329 -0.002009339 -0.0009181220
reduceMatrix(hugemtx, newdim = c(4, 4, 2, 1), func = sd)
## , , 1, 1
## [,1] [,2] [,3] [,4]
## [1,] 1.000269 1.000427 1.000454 0.9999282
## [2,] 1.000533 0.999885 1.000669 1.0003171
## [3,] 1.000580 1.000232 1.000167 0.9999460
## [4,] 1.000744 1.000150 1.000510 0.9994936
## , , 2, 1
## [,1] [,2] [,3] [,4]
## [1,] 0.9997954 0.9991735 1.000129 0.9996085
## [2,] 0.9999844 1.0002422 1.000016 0.9999593
## [3,] 1.0002473 1.0001445 1.000639 0.9998599
## [4,] 1.0005914 0.9993628 0.999104 1.0002434
(I realize on re-reading your problem statement for the seventeenth time that your output is not what I have been envisioning. I wonder if this can be adapted ...)
Upvotes: 1
Reputation: 1985
Hmmm... seems like you want mutually exclusive clusters that are rank-ordered.
I'm sure it can be done for 2D & 3D, but I'm not sure that the solutions would be necessarily unique unless there was a hierarchy to the variables.
As an example of the uniqueness problem, consider a 2D space with uniformly distributed data centered at the origin with k = 2. Any pair of line that equally divides the data into 4 bins (i.e. any two lines through the origin that are orthogonal) would create a 4 bin case, but the rank ordering is poorly defined unless we have some weighting for each of the two variables.
However, if you have the relative weighting (or importance) for each variable... any projection of the 2D space into 1D space could be used to define the rank and an optimal solution can be defined.
Upvotes: 0