Reputation: 219
I want to create multiple variables that are aggregating various subsets of a dataset. For an illustrating example, say you have the following data:
DT = data.table(Group1 = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
Group2 = c(1,1,1,2,2,1,1,2,2,2,1,1,1,1,2,1,1,2,2,2),
Var1 = c(1,1,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0))
I want to find several averages of variable Var1
. I want to know:
mean(Var1)
grouped by Group1
mean(Var1)
for only those with Group2 == 1
, grouped by Group1
mean(Var1)
for only those with Group2 == 2
, grouped by Group1
Or, in data.table parlance,
DT[, mean(Var1), by=Group1]
DT[Group2==1, mean(Var1), by=Group1]
DT[Group2==2, mean(Var1), by=Group1]
Obviously, calculating any one of these is very straightforward. But I can't find a good way to calculate all three of them, since they use different subsets in i
. The solution I've been using so far is generating them individually, then merging them into a unified table.
DT_all <- DT[, .(avgVar1_all = mean(Var1)), by = Group1]
DT_1 <- DT[Group2 == 1, .(avgVar1_1 = mean(Var1)), by = Group1]
DT_2 <- DT[Group2 == 2, .(avgVar1_2 = mean(Var1)), by = Group1]
group_info <- merge(DT_all, DT_1, by = "Group1")
group_info <- merge(group_info, DT_2, by = "Group1")
group_info
# Group1 avgVar1_all avgVar1_1 avgVar1_2
# 1: 1 0.4 0.6666667 0.0000000
# 2: 2 0.6 1.0000000 0.3333333
# 3: 3 0.2 0.2500000 0.0000000
# 4: 4 0.0 0.0000000 0.0000000
Is there a more elegant method I could be using?
Upvotes: 2
Views: 79
Reputation: 66819
You can use reshape2::dcast
:
reshape2::dcast(DT, Group1 ~ Group2, fun=mean, margins="Group2")
Group1 1 2 (all)
1 1 0.6666667 0.0000000 0.4
2 2 1.0000000 0.3333333 0.6
3 3 0.2500000 0.0000000 0.2
4 4 0.0000000 0.0000000 0.0
@thelatmail noted in a comment below that this approach does not scale well. Eventually, margins
should be available in data.table's dcast
, which will probably be more efficient.
An ugly workaround:
DT[, c(
dcast(.SD, Group1 ~ Group2, fun=mean),
all = .(dcast(.SD, Group1 ~ ., fun=mean)$.)
)]
Group1 1 2 all
1: 1 0.6666667 0.0000000 0.4
2: 2 1.0000000 0.3333333 0.6
3: 3 0.2500000 0.0000000 0.2
4: 4 0.0000000 0.0000000 0.0
Upvotes: 3
Reputation: 93908
Just do it all in one grouping operation using .SD
:
DT[, .(
all = mean(Var1),
grp1 = .SD[Group2==1, mean(Var1)],
grp2 = .SD[Group2==2, mean(Var1)]
),
by = Group1,
.SDcols=c("Group2","Var1")
]
# Group1 all grp1 grp2
#1: 1 0.4 0.6666667 0.0000000
#2: 2 0.6 1.0000000 0.3333333
#3: 3 0.2 0.2500000 0.0000000
#4: 4 0.0 0.0000000 0.0000000
Upvotes: 5