Connor J
Connor J

Reputation: 219

How to create multiple new columns with data.table's j over different subsets

I want to create multiple variables that are aggregating various subsets of a dataset. For an illustrating example, say you have the following data:

DT = data.table(Group1 = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4), 
                Group2 = c(1,1,1,2,2,1,1,2,2,2,1,1,1,1,2,1,1,2,2,2), 
                  Var1 = c(1,1,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0))

I want to find several averages of variable Var1. I want to know:

Or, in data.table parlance,

DT[, mean(Var1), by=Group1]
DT[Group2==1, mean(Var1), by=Group1]
DT[Group2==2, mean(Var1), by=Group1]

Obviously, calculating any one of these is very straightforward. But I can't find a good way to calculate all three of them, since they use different subsets in i. The solution I've been using so far is generating them individually, then merging them into a unified table.

DT_all <- DT[, .(avgVar1_all = mean(Var1)), by = Group1]
DT_1 <- DT[Group2 == 1, .(avgVar1_1 = mean(Var1)), by = Group1]
DT_2 <- DT[Group2 == 2, .(avgVar1_2 = mean(Var1)), by = Group1]
group_info <- merge(DT_all, DT_1, by = "Group1")
group_info <- merge(group_info, DT_2, by = "Group1")

group_info
#    Group1 avgVar1_all avgVar1_1 avgVar1_2
# 1:      1         0.4 0.6666667 0.0000000
# 2:      2         0.6 1.0000000 0.3333333
# 3:      3         0.2 0.2500000 0.0000000
# 4:      4         0.0 0.0000000 0.0000000

Is there a more elegant method I could be using?

Upvotes: 2

Views: 79

Answers (2)

Frank
Frank

Reputation: 66819

You can use reshape2::dcast:

reshape2::dcast(DT, Group1 ~ Group2, fun=mean, margins="Group2")


  Group1         1         2 (all)
1      1 0.6666667 0.0000000   0.4
2      2 1.0000000 0.3333333   0.6
3      3 0.2500000 0.0000000   0.2
4      4 0.0000000 0.0000000   0.0

@thelatmail noted in a comment below that this approach does not scale well. Eventually, margins should be available in data.table's dcast, which will probably be more efficient.

An ugly workaround:

DT[, c(
  dcast(.SD, Group1 ~ Group2, fun=mean), 
  all = .(dcast(.SD, Group1 ~ ., fun=mean)$.)
)]


   Group1         1         2 all
1:      1 0.6666667 0.0000000 0.4
2:      2 1.0000000 0.3333333 0.6
3:      3 0.2500000 0.0000000 0.2
4:      4 0.0000000 0.0000000 0.0

Upvotes: 3

thelatemail
thelatemail

Reputation: 93908

Just do it all in one grouping operation using .SD:

DT[, .(
        all  = mean(Var1),
        grp1 = .SD[Group2==1, mean(Var1)],
        grp2 = .SD[Group2==2, mean(Var1)]
      ),
  by = Group1,
  .SDcols=c("Group2","Var1")
  ]

#   Group1 all      grp1      grp2
#1:      1 0.4 0.6666667 0.0000000
#2:      2 0.6 1.0000000 0.3333333
#3:      3 0.2 0.2500000 0.0000000
#4:      4 0.0 0.0000000 0.0000000

Upvotes: 5

Related Questions