Reputation: 425
My questions is best described with an example:
data:
v1 <- c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p")
v2 <- c("aa","bb","cc","dd","ee","ff","gg","hh","ii","jj","kk","ll","mm","nn","oo","pp")
v3 <- c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","jjj","kkk","lll","mmm","nnn","ooo","ppp")
values <- 1:10000
y <- data.frame(var1=sample(v1,size = 100,replace=T),
var2=sample(v2,size=100,replace=T),
var3=sample(v3,size=100,replace=T),
val1=sample(values,size=100,replace=T),
val2=sample(values,size=100,replace=T))
head(y)
var1 var2 var3 val1 val2
1 h hh lll 4832 9547
2 h nn eee 727 6382
3 h bb jjj 9985 8041
4 e bb jjj 1438 280
5 k ff nnn 3094 4316
6 f cc kkk 7426 3588
Basically I need to sum val1 and val2 by all unique combinations of var1,var2, and var3. So, none of the rows in the head
statement above would get summed together.
In the actual data I'm working with, there are thousands of different values for var1, var2, and var3. Additionally, there are a varying number of columns I want to group by. This is going to be used in a shiny app where the user selects any number of groups that he/she wants to aggregate by. I've tried using stats::aggregate
but it doesn't seem to be versatile enough for this.
If you need anything else to help answer my question let me know.
EDIT: Very helpful responses so far and thank you for that, but they aren't quite what I'm looking for (my fault, I didn't state the question clear enough).
The issue is automating my script so it can aggregate the example above as well as the following:
head(y)
var1 var2 var3 var4 var5 val1 val2 val3
1 p pp nnn ll aaa 3914 6621 8968
2 d ii jjj pp aaa 5885 955 7560
3 j bb ppp aa ccc 7979 5602 4639
4 e mm ooo mm fff 8746 8612 2212
5 g ee aaa kk lll 4795 6438 6318
6 f mm kkk ii ggg 7550 90 1173
In my shiny app the data could just as easily look like this, or some other variation with any number of columns. I know the index values for the variable and value columns, but not their names. So I can use the index values, but they are changing. Also I can't really hard code in any columns since they are always changing.
I hope this helps!
Upvotes: 0
Views: 856
Reputation: 2489
library(data.table)
y<-data.table(y)
setkey(y, var1, var2, var3)
y[,.(sum1 = sum(val1), sum2 = sum(val2)), by = c("var1", "var2", "var3") ]
var1 var2 var3 sum1 sum2
a cc ggg 4483 5176
a gg ccc 1928 1833
a mm ppp 1550 3930
a nn kkk 1340 2080
b dd ccc 8950 158
b hh bbb 4798 689
b kk mmm 129 9740
b pp ccc 8363 1531
c gg ggg 1662 125
c gg hhh 4485 3800
...
The other popular option is dplyr
:
library(dplyr)
y %>% group_by(var1, var2, var3) %>%
summarise(sum1 = sum(val1), sum2 =sum(val2))
var1 var2 var3 sum1 sum2
(fctr) (fctr) (fctr) (int) (int)
a cc ggg 4483 5176
a gg ccc 1928 1833
a mm ppp 1550 3930
a nn kkk 1340 2080
b dd ccc 8950 158
b hh bbb 4798 689
b kk mmm 129 9740
b pp ccc 8363 1531
c gg ggg 1662 125
c gg hhh 4485 3800
Some say the syntax of dplyr
is a little easier to read/write, but I prefer both methods equally. Speed is usually comparable, though I believe data.table
has the edge for very large data sets.
Upvotes: 3