Omry Atia
Omry Atia

Reputation: 2443

aggregate matrix rows based on groups given in a list

I have the following matrix M

structure(c(0, 0.2, 0.4, 0.6, 0.8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.2, 0.4, 0.6, 0.8, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 423, 176, 135, 
30, 4), .Dim = c(5L, 19L), .Dimnames = list(NULL, c("pregnant_min", 
"glucose_min", "blood_min", "skin_min", "INSULIN_min", "MASS_min", 
"DIAB_min", "AGE_min", "CLASS_min", "pregnant_max", "glucose_max", 
"blood_max", "skin_max", "INSULIN_max", "MASS_max", "DIAB_max", 
"AGE_max", "CLASS_max", "NumOfObser")))

and a list L:

L = list(1L, 2L, 3:5)

The elements of the list indicate which rows of M should be grouped together. The first and second lines should be groups of their own. The 3-5 lines should form a group in the following sense:

Rows 3-5 of M should be replaced with one row, whose min of every value should be the min of the minimum of rows 3-5, the max should be the maximum, and its number of observations should be the sum.

So the output should look like this:

structure(c(0, 0.2, 0.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.2, 0.4, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 423, 
176, 169), .Dim = c(3L, 19L), .Dimnames = list(NULL, c("pregnant_min", 
"glucose_min", "blood_min", "skin_min", "INSULIN_min", "MASS_min", 
"DIAB_min", "AGE_min", "CLASS_min", "pregnant_max", "glucose_max", 
"blood_max", "skin_max", "INSULIN_max", "MASS_max", "DIAB_max", 
"AGE_max", "CLASS_max", "NumOfObser")))

The elements of the list L can be composed of any combination of 1-5, corresponding to the number of rows of M.

How can I achieve this output in the general case? So far I have looped through the elements of L, but I am pretty sure there is a more neat/efficient way of doing this.

Upvotes: 0

Views: 234

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269852

We assume that we should use min or max on a particular column if the column name ends in _min or _max and sum if the column name contains Num.

Now, create a grouping variable g. We have used a general expression below but if we knew that unlist(L) equalled 1:nrow(M), as it does here, then values would equal 1:nrow(M) and the expression for g reduces to the simpler expression g <- stack(setNames(L, seq_along(L))$ind.

Also define fn to be a character vector of function names ("min", "max", "sum") to apply. Then iterate simultaneously over the columns and fn using mapply and for each pair use tapply to process it as shown.

This should generalize as long as the same coding in the column names is used for min, max and sum. The order of the input columns is arbitrary and will be maintained on output. For example, if the column order were pregnant_min, pregnant_max, glucose_min, glucose_max, etc. it would still work and return the columns in that order.

No packages are used.

g <- with(stack(setNames(L, seq_along(L))), ind[order(values)])
fn <- sub(".*_", "", colnames(M))
fn[grepl("Num", colnames(M))] <- "sum"
mapply(function(col, fn) tapply(col, g, fn), as.list(as.data.frame(M)), fn)

giving:

  pregnant_min glucose_min blood_min skin_min INSULIN_min MASS_min DIAB_min
1          0.0           0         0        0           0        0        0
2          0.2           0         0        0           0        0        0
3          0.4           0         0        0           0        0        0
  AGE_min CLASS_min pregnant_max glucose_max blood_max skin_max INSULIN_max
1       0         0          0.2           1         1        1           1
2       0         0          0.4           1         1        1           1
3       0         0          1.0           1         1        1           1
  MASS_max DIAB_max AGE_max CLASS_max NumOfObser
1        1        1       1         1        423
2        1        1       1         1        176
3        1        1       1         1        169

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389135

library(matrixStats)
#Get index of "min" cols
min_col <- grep("min", colnames(M))
#Get index of "max" cols
max_col <- grep("max", colnames(M))

setNames(do.call("rbind.data.frame", lapply(L, function(x) {
         if (length(x) > 1)
           c(colMins(M[x, min_col]), colMaxs(M[x, max_col]), sum(M[x, "NumOfObser"]))
         else
           M[x, ]
})), colnames(M))


#  pregnant_min glucose_min blood_min skin_min INSULIN_min MASS_min DIAB_min AGE_min
#1          0.0           0         0        0           0        0        0       0
#2          0.2           0         0        0           0        0        0       0
#3          0.4           0         0        0           0        0        0       0

#  CLASS_min pregnant_max glucose_max blood_max skin_max INSULIN_max MASS_max DIAB_max
#1         0          0.2           1         1        1           1        1        1
#2         0          0.4           1         1        1           1        1        1
#3         0          1.0           1         1        1           1        1        1

#  AGE_max CLASS_max NumOfObser
#1       1         1        423
#2       1         1        176
#3       1         1        169

We first find out the index of "max" and "min" cols and store them in separate vector. For every list element in L we check it's length and if it is 1 then we return the row as it is since max and min on 1 row matrix would give us the same row. If the length is greater than 1 then we take minimum of each min_col and maximum in each max_col take sum of "NumOfObser" column and return one row for that group. Finally we rbind all these rows and giving it's original meaningful names using setNames.


I have used colMins and colMaxs functions because it is easy to understand and makes the operation simple. If anybody is interested only in base R answer they can use sapply to get column-wise max and min

setNames(do.call("rbind.data.frame", lapply(L, function(x) {
    if (length(x) > 1)
      c(sapply(data.frame(M[x, min_col]), min), 
        sapply(data.frame(M[x, max_col]), max), 
        sum(M[x, "NumOfObser"]))
     else
       M[x, ]
 })), colnames(M))

Upvotes: 3

Related Questions