aggregate matrix rows based on groups given in a list

Question

I have the following matrix M

structure(c(0, 0.2, 0.4, 0.6, 0.8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.2, 0.4, 0.6, 0.8, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 423, 176, 135, 
30, 4), .Dim = c(5L, 19L), .Dimnames = list(NULL, c("pregnant_min", 
"glucose_min", "blood_min", "skin_min", "INSULIN_min", "MASS_min", 
"DIAB_min", "AGE_min", "CLASS_min", "pregnant_max", "glucose_max", 
"blood_max", "skin_max", "INSULIN_max", "MASS_max", "DIAB_max", 
"AGE_max", "CLASS_max", "NumOfObser")))

and a list L:

L = list(1L, 2L, 3:5)

The elements of the list indicate which rows of M should be grouped together. The first and second lines should be groups of their own. The 3-5 lines should form a group in the following sense:

Rows 3-5 of M should be replaced with one row, whose min of every value should be the min of the minimum of rows 3-5, the max should be the maximum, and its number of observations should be the sum.

So the output should look like this:

structure(c(0, 0.2, 0.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.2, 0.4, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 423, 
176, 169), .Dim = c(3L, 19L), .Dimnames = list(NULL, c("pregnant_min", 
"glucose_min", "blood_min", "skin_min", "INSULIN_min", "MASS_min", 
"DIAB_min", "AGE_min", "CLASS_min", "pregnant_max", "glucose_max", 
"blood_max", "skin_max", "INSULIN_max", "MASS_max", "DIAB_max", 
"AGE_max", "CLASS_max", "NumOfObser")))

The elements of the list L can be composed of any combination of 1-5, corresponding to the number of rows of M.

How can I achieve this output in the general case? So far I have looped through the elements of L, but I am pretty sure there is a more neat/efficient way of doing this.

Ronak Shah · Accepted Answer

library(matrixStats)
#Get index of "min" cols
min_col <- grep("min", colnames(M))
#Get index of "max" cols
max_col <- grep("max", colnames(M))

setNames(do.call("rbind.data.frame", lapply(L, function(x) {
         if (length(x) > 1)
           c(colMins(M[x, min_col]), colMaxs(M[x, max_col]), sum(M[x, "NumOfObser"]))
         else
           M[x, ]
})), colnames(M))


#  pregnant_min glucose_min blood_min skin_min INSULIN_min MASS_min DIAB_min AGE_min
#1          0.0           0         0        0           0        0        0       0
#2          0.2           0         0        0           0        0        0       0
#3          0.4           0         0        0           0        0        0       0

#  CLASS_min pregnant_max glucose_max blood_max skin_max INSULIN_max MASS_max DIAB_max
#1         0          0.2           1         1        1           1        1        1
#2         0          0.4           1         1        1           1        1        1
#3         0          1.0           1         1        1           1        1        1

#  AGE_max CLASS_max NumOfObser
#1       1         1        423
#2       1         1        176
#3       1         1        169

We first find out the index of "max" and "min" cols and store them in separate vector. For every list element in L we check it's length and if it is 1 then we return the row as it is since max and min on 1 row matrix would give us the same row. If the length is greater than 1 then we take minimum of each min_col and maximum in each max_col take sum of "NumOfObser" column and return one row for that group. Finally we rbind all these rows and giving it's original meaningful names using setNames.

I have used colMins and colMaxs functions because it is easy to understand and makes the operation simple. If anybody is interested only in base R answer they can use sapply to get column-wise max and min

setNames(do.call("rbind.data.frame", lapply(L, function(x) {
    if (length(x) > 1)
      c(sapply(data.frame(M[x, min_col]), min), 
        sapply(data.frame(M[x, max_col]), max), 
        sum(M[x, "NumOfObser"]))
     else
       M[x, ]
 })), colnames(M))

aggregate matrix rows based on groups given in a list

Answers (2)

Related Questions