Reputation: 2443
I have the following matrix M
structure(c(0, 0.2, 0.4, 0.6, 0.8, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.2, 0.4, 0.6, 0.8, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 423, 176, 135,
30, 4), .Dim = c(5L, 19L), .Dimnames = list(NULL, c("pregnant_min",
"glucose_min", "blood_min", "skin_min", "INSULIN_min", "MASS_min",
"DIAB_min", "AGE_min", "CLASS_min", "pregnant_max", "glucose_max",
"blood_max", "skin_max", "INSULIN_max", "MASS_max", "DIAB_max",
"AGE_max", "CLASS_max", "NumOfObser")))
and a list L
:
L = list(1L, 2L, 3:5)
The elements of the list indicate which rows of M
should be grouped together.
The first and second lines should be groups of their own. The 3-5 lines should form a group in the following sense:
Rows 3-5 of M should be replaced with one row, whose min
of every value should be the min
of the minimum of rows 3-5, the max
should be the maximum, and its number of observations should be the sum.
So the output should look like this:
structure(c(0, 0.2, 0.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.2, 0.4, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 423,
176, 169), .Dim = c(3L, 19L), .Dimnames = list(NULL, c("pregnant_min",
"glucose_min", "blood_min", "skin_min", "INSULIN_min", "MASS_min",
"DIAB_min", "AGE_min", "CLASS_min", "pregnant_max", "glucose_max",
"blood_max", "skin_max", "INSULIN_max", "MASS_max", "DIAB_max",
"AGE_max", "CLASS_max", "NumOfObser")))
The elements of the list L
can be composed of any combination of 1-5, corresponding to the number of rows of M
.
How can I achieve this output in the general case? So far I have looped through the elements of L
, but I am pretty sure there is a more neat/efficient way of doing this.
Upvotes: 0
Views: 234
Reputation: 269852
We assume that we should use min
or max
on a particular column if the column name ends in _min
or _max
and sum
if the column name contains Num
.
Now, create a grouping variable g
. We have used a general expression below but if we knew that unlist(L)
equalled 1:nrow(M)
, as it does here, then values
would equal 1:nrow(M)
and the expression for g
reduces to the simpler expression g <- stack(setNames(L, seq_along(L))$ind
.
Also define fn
to be a character vector of function names ("min"
, "max"
, "sum"
) to apply. Then iterate simultaneously over the columns and fn
using mapply
and for each pair use tapply
to process it as shown.
This should generalize as long as the same coding in the column names is used for min
, max
and sum
. The order of the input columns is arbitrary and will be maintained on output. For example, if the column order were pregnant_min
, pregnant_max
, glucose_min
, glucose_max
, etc. it would still work and return the columns in that order.
No packages are used.
g <- with(stack(setNames(L, seq_along(L))), ind[order(values)])
fn <- sub(".*_", "", colnames(M))
fn[grepl("Num", colnames(M))] <- "sum"
mapply(function(col, fn) tapply(col, g, fn), as.list(as.data.frame(M)), fn)
giving:
pregnant_min glucose_min blood_min skin_min INSULIN_min MASS_min DIAB_min
1 0.0 0 0 0 0 0 0
2 0.2 0 0 0 0 0 0
3 0.4 0 0 0 0 0 0
AGE_min CLASS_min pregnant_max glucose_max blood_max skin_max INSULIN_max
1 0 0 0.2 1 1 1 1
2 0 0 0.4 1 1 1 1
3 0 0 1.0 1 1 1 1
MASS_max DIAB_max AGE_max CLASS_max NumOfObser
1 1 1 1 1 423
2 1 1 1 1 176
3 1 1 1 1 169
Upvotes: 1
Reputation: 389135
library(matrixStats)
#Get index of "min" cols
min_col <- grep("min", colnames(M))
#Get index of "max" cols
max_col <- grep("max", colnames(M))
setNames(do.call("rbind.data.frame", lapply(L, function(x) {
if (length(x) > 1)
c(colMins(M[x, min_col]), colMaxs(M[x, max_col]), sum(M[x, "NumOfObser"]))
else
M[x, ]
})), colnames(M))
# pregnant_min glucose_min blood_min skin_min INSULIN_min MASS_min DIAB_min AGE_min
#1 0.0 0 0 0 0 0 0 0
#2 0.2 0 0 0 0 0 0 0
#3 0.4 0 0 0 0 0 0 0
# CLASS_min pregnant_max glucose_max blood_max skin_max INSULIN_max MASS_max DIAB_max
#1 0 0.2 1 1 1 1 1 1
#2 0 0.4 1 1 1 1 1 1
#3 0 1.0 1 1 1 1 1 1
# AGE_max CLASS_max NumOfObser
#1 1 1 423
#2 1 1 176
#3 1 1 169
We first find out the index of "max" and "min" cols and store them in separate vector. For every list element in L
we check it's length
and if it is 1 then we return the row as it is since max
and min
on 1 row matrix would give us the same row. If the length
is greater than 1 then we take minimum of each min_col
and maximum in each max_col
take sum
of "NumOfObser" column and return one row for that group. Finally we rbind
all these rows and giving it's original meaningful names using setNames
.
I have used colMins
and colMaxs
functions because it is easy to understand and makes the operation simple. If anybody is interested only in base R answer they can use sapply
to get column-wise max
and min
setNames(do.call("rbind.data.frame", lapply(L, function(x) {
if (length(x) > 1)
c(sapply(data.frame(M[x, min_col]), min),
sapply(data.frame(M[x, max_col]), max),
sum(M[x, "NumOfObser"]))
else
M[x, ]
})), colnames(M))
Upvotes: 3