Chase Grimm
Chase Grimm

Reputation: 425

Aggregate data by multiple groups

My questions is best described with an example:

data:

v1 <- c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p")
v2 <- c("aa","bb","cc","dd","ee","ff","gg","hh","ii","jj","kk","ll","mm","nn","oo","pp")
v3 <- c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","jjj","kkk","lll","mmm","nnn","ooo","ppp")
values <- 1:10000
y <- data.frame(var1=sample(v1,size = 100,replace=T),
                var2=sample(v2,size=100,replace=T),
                var3=sample(v3,size=100,replace=T),
                val1=sample(values,size=100,replace=T),
                val2=sample(values,size=100,replace=T))


head(y)
  var1 var2 var3 val1 val2
1    h   hh  lll 4832 9547
2    h   nn  eee  727 6382
3    h   bb  jjj 9985 8041
4    e   bb  jjj 1438  280
5    k   ff  nnn 3094 4316
6    f   cc  kkk 7426 3588

Basically I need to sum val1 and val2 by all unique combinations of var1,var2, and var3. So, none of the rows in the head statement above would get summed together.

In the actual data I'm working with, there are thousands of different values for var1, var2, and var3. Additionally, there are a varying number of columns I want to group by. This is going to be used in a shiny app where the user selects any number of groups that he/she wants to aggregate by. I've tried using stats::aggregate but it doesn't seem to be versatile enough for this.

If you need anything else to help answer my question let me know.

EDIT: Very helpful responses so far and thank you for that, but they aren't quite what I'm looking for (my fault, I didn't state the question clear enough).

The issue is automating my script so it can aggregate the example above as well as the following:

head(y)
  var1 var2 var3 var4 var5 val1 val2 val3
1    p   pp  nnn   ll  aaa 3914 6621 8968
2    d   ii  jjj   pp  aaa 5885  955 7560
3    j   bb  ppp   aa  ccc 7979 5602 4639
4    e   mm  ooo   mm  fff 8746 8612 2212
5    g   ee  aaa   kk  lll 4795 6438 6318
6    f   mm  kkk   ii  ggg 7550   90 1173

In my shiny app the data could just as easily look like this, or some other variation with any number of columns. I know the index values for the variable and value columns, but not their names. So I can use the index values, but they are changing. Also I can't really hard code in any columns since they are always changing.

I hope this helps!

Upvotes: 0

Views: 856

Answers (1)

Bryan Goggin
Bryan Goggin

Reputation: 2489

library(data.table)
y<-data.table(y)
setkey(y, var1, var2, var3)

y[,.(sum1 = sum(val1), sum2 = sum(val2)), by = c("var1", "var2", "var3") ]

var1 var2 var3  sum1 sum2
a   cc  ggg  4483 5176
a   gg  ccc  1928 1833
a   mm  ppp  1550 3930
a   nn  kkk  1340 2080
b   dd  ccc  8950  158
b   hh  bbb  4798  689
b   kk  mmm   129 9740
b   pp  ccc  8363 1531
c   gg  ggg  1662  125
c   gg  hhh  4485 3800
...

The other popular option is dplyr:

library(dplyr)

y %>% group_by(var1, var2, var3) %>% 
  summarise(sum1 = sum(val1), sum2 =sum(val2))

 var1   var2   var3  sum1  sum2
(fctr) (fctr) (fctr) (int) (int)
    a     cc    ggg  4483  5176
    a     gg    ccc  1928  1833
    a     mm    ppp  1550  3930
    a     nn    kkk  1340  2080
    b     dd    ccc  8950   158
    b     hh    bbb  4798   689
    b     kk    mmm   129  9740
    b     pp    ccc  8363  1531
    c     gg    ggg  1662   125
    c     gg    hhh  4485  3800

Some say the syntax of dplyr is a little easier to read/write, but I prefer both methods equally. Speed is usually comparable, though I believe data.table has the edge for very large data sets.

Upvotes: 3

Related Questions