Reputation: 2505
I have the following data:
test <- data.frame(Var1 = factor(c("A", "B", "A", "A", "B")),
Var2 = factor(c("I", "II", "II", "I", "III")),
x = c(1,1,2,3,4)
)
I would like to get the sum of x
for every level of each variable separately. Aggregate does not do what I need:
>aggregate(test$x, by=list("Var1"= test$Var1,"Var2" = test$Var2), sum)
Var1 Var2 x
1 A I 4
2 A II 2
3 B II 1
4 B III 4
I would like it to be the following:
Var1-A 6
Var1-B 5
Var2-I 4
Var2-II 3
Var2-III 4
In other words, it is a loop over each variable, and then each variable's levels.
How can I do this without loops? I figure a reshape
is in order, but I'm not sure how to do it.
Upvotes: 0
Views: 458
Reputation: 42544
The OP wants to compute the aggregate for each variable separately but wants to combine the separate results in one data frame. In SQL
the latter step would correspond to a UNION
operation.
With data.table
, this can be achieved by rbindlist()
:
library(data.table)
setDT(test)
cols <- c("Var1", "Var2")
rbindlist(
lapply(cols, function(.col) test[, sum(x), by = .col]), idcol = TRUE
)[, setnames(.SD, c("var", "lvl", "sum_x"))][, var := factor(var, labels = cols)][]
# var lvl sum_x
#1: Var1 A 6
#2: Var1 B 5
#3: Var2 I 4
#4: Var2 II 3
#5: Var2 III 4
Explanation
cols
.lapply()
and for each variable compute the aggregates grouped by the levels within the variable. The result is a list of data.tables.rbindlist
. The id column gives the number of the list element the rows are taken from.Note, we could have named the list elements with the variable names by setNames()
before calling rbindlist()
but this would come with a perfomance penalty due to copying the result. Instead, this is done in the last step where only factor labels have to be assigned to the numbers in the id column.
Of course, there is an alternative approach (similar to this answer) which first reshapes the data from wide to long format and then does the aggregates grouped by variable and level.
melt(setDT(test), "x", value.name = "level")[, .(sum_x = sum(x)), by = .(variable, level)]
# variable level sum_x
#1: Var1 A 6
#2: Var1 B 5
#3: Var2 I 4
#4: Var2 II 3
#5: Var2 III 4
data.table
reports no issues with factor levels.
It would be interesting to see which of the two approaches is more efficient in terms of memory consumption and speed for large problems.
Upvotes: 1
Reputation: 33772
This works (with a warning message because factors with different levels are being gathered into a single column):
test %>%
gather(level1, level2, -x) %>%
group_by(level1, level2) %>%
summarise(sum(x))
Result:
level1 level2 `sum(x)`
<chr> <chr> <dbl>
1 Var1 A 6
2 Var1 B 5
3 Var2 I 4
4 Var2 II 3
5 Var2 III 4
There are fixes for the factor levels issue in this answer.
Upvotes: 1