Aggregate by Variables and Levels in R

Question

I have the following data:

test <- data.frame(Var1 = factor(c("A", "B", "A", "A", "B")),
                   Var2 = factor(c("I", "II", "II", "I", "III")),
                   x = c(1,1,2,3,4)
                   )

I would like to get the sum of x for every level of each variable separately. Aggregate does not do what I need:

>aggregate(test$x, by=list("Var1"= test$Var1,"Var2" = test$Var2), sum)
  Var1 Var2 x
1    A    I 4
2    A   II 2
3    B   II 1
4    B  III 4

I would like it to be the following:

Var1-A 6
Var1-B 5
Var2-I 4
Var2-II 3
Var2-III 4

In other words, it is a loop over each variable, and then each variable's levels.

How can I do this without loops? I figure a reshape is in order, but I'm not sure how to do it.

Uwe · Accepted Answer

The OP wants to compute the aggregate for each variable separately but wants to combine the separate results in one data frame. In SQL the latter step would correspond to a UNION operation.

With data.table, this can be achieved by rbindlist():

library(data.table)
setDT(test)
cols <- c("Var1", "Var2")
rbindlist(
  lapply(cols, function(.col) test[, sum(x), by = .col]), idcol = TRUE
)[, setnames(.SD, c("var", "lvl", "sum_x"))][, var := factor(var, labels = cols)][]
#    var lvl sum_x
#1: Var1   A     6
#2: Var1   B     5
#3: Var2   I     4
#4: Var2  II     3
#5: Var2 III     4

Explanation

Store the variable names to loop over in cols.
Loop over the variables using lapply() and for each variable compute the aggregates grouped by the levels within the variable. The result is a list of data.tables.
The list is combined row-wise using rbindlist. The id column gives the number of the list element the rows are taken from.
Rename the columns.
Convert the number in the id column to the names of the variables.

Note, we could have named the list elements with the variable names by setNames() before calling rbindlist() but this would come with a perfomance penalty due to copying the result. Instead, this is done in the last step where only factor labels have to be assigned to the numbers in the id column.

Of course, there is an alternative approach (similar to this answer) which first reshapes the data from wide to long format and then does the aggregates grouped by variable and level.

melt(setDT(test), "x", value.name = "level")[, .(sum_x = sum(x)), by = .(variable, level)]
#   variable level sum_x
#1:     Var1     A     6
#2:     Var1     B     5
#3:     Var2     I     4
#4:     Var2    II     3
#5:     Var2   III     4

data.tablereports no issues with factor levels.

It would be interesting to see which of the two approaches is more efficient in terms of memory consumption and speed for large problems.

Aggregate by Variables and Levels in R

Answers (2)

Related Questions