Reputation: 907
Supposed, I have the following data.table
library(data.table)
dt <- data.table(x1 = c(1:12), x2=c(21:32))
Then, I create bins by user-specified intervals using the following:
dt[,intx1:=cut(x1, breaks = c(-Inf, 4, 9, Inf))]
return,
x1 x2 intx1
1: 1 21 (-Inf,4]
2: 2 22 (-Inf,4]
3: 3 23 (-Inf,4]
4: 4 24 (-Inf,4]
5: 5 25 (4,9]
6: 6 26 (4,9]
7: 7 27 (4,9]
8: 8 28 (4,9]
9: 9 29 (4,9]
10: 10 30 (9, Inf]
11: 11 31 (9, Inf]
12: 12 32 (9, Inf]
I am trying to find the mean difference between bins and variable:
dt[, mux1_grp:=mean(x1), by = intx1][,mux1_pop:=mean(x1)][,mux1_diff:=mux1_grp-mux1_pop]
dt[,`:=`(intx1=NULL, mux1_grp=NULL, mux1_pop=NULL)]
The return is:
x1 x2 mux1_diff
1: 1 21 -4.0
2: 2 22 -4.0
3: 3 23 -4.0
4: 4 24 -4.0
5: 5 25 0.5
6: 6 26 0.5
7: 7 27 0.5
8: 8 28 0.5
9: 9 29 0.5
10: 10 30 4.5
11: 11 31 4.5
12: 12 32 4.5
However, my original data contain several variables (e.g., x1, x2,...,x20).
So, I have to repeat the same procedure for x2 as follows:
dt[,intx2:=cut(x2, breaks = c(-Inf, 25, 28, Inf))]
dt[, mux2_grp:=mean(x2), by = intx2][,mux2_pop:=mean(x2)][,mux2_diff:=mux2_grp-mux2_pop]
dt[,`:=`(intx2=NULL, mux2_grp=NULL, mux2_pop=NULL)]
My final output will be:
x1 x2 mux1_diff mux2_diff
1: 1 21 -4.0 -3.5
2: 2 22 -4.0 -3.5
3: 3 23 -4.0 -3.5
4: 4 24 -4.0 -3.5
5: 5 25 0.5 -3.5
6: 6 26 0.5 0.5
7: 7 27 0.5 0.5
8: 8 28 0.5 0.5
9: 9 29 0.5 4.0
10: 10 30 4.5 4.0
11: 11 31 4.5 4.0
12: 12 32 4.5 4.0
How can I improve this code? Note that each variable has different user-specified intervals
Upvotes: 3
Views: 447
Reputation: 887118
We can do this in a compact one-step option (though it may be not that optimal when compared to the OP's method (from @Frank's comments)
dt[, mu_diff := mean(x) - mean(dt$x), by = .(cut(x, breaks = c(-Inf, 4, 9, Inf)))][]
# x mu_diff
#1: 1 -3.8636364
#2: 2 -3.8636364
#3: 3 -3.8636364
#4: 4 -3.8636364
#5: 5 0.3863636
#6: 6 0.3863636
#7: 7 0.3863636
#8: 9 0.3863636
#9: 10 4.6363636
#10:11 4.6363636
#11:12 4.6363636
If there are many variables (it is not clear whether we use the same breaks
in cut
or not for different columns - assuming that it is the same), we can loop through the columns (in the reproducible example below, two variables 'x1' and 'x2' are showed), specify the .SDcols
by the numeric index of the column, grouped by the cut
of the subsetted column, we assign the new column as the difference between the mean
of the values within the group and the mean
of the entire column.
nm1 <- paste0("mu_diff", seq_along(dt1))
for(j in seq_along(dt1)){
dt1[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt1[[names(dt1)[j]]]),
by = .(cut(get(names(dt1)[j]), breaks = c(-Inf, 4, 9, Inf))) ,
.SDcols = j][]
}
Update - Suppose if the breaks
for cut
variable is different for each column, place it in a list
and get that list
element inside the for
loop with the indexing.
brkLst <- list(c(-Inf, 4, 9, Inf), c(-Inf, 10, 14, Inf))
for(j in seq_along(dt1)){
dt1[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt1[[names(dt1)[j]]]),
by = .(cut(get(names(dt1)[j]), breaks = brkLst[[j]])) ,
.SDcols = j][]
}
Checking the output with OP's new data ('dt2')
brkLst2 <- list(c(-Inf, 4, 9, Inf), c(-Inf, 25, 28, Inf))
nm1 <- paste0("mu", names(dt2), "_diff")
for(j in seq_along(dt2)){
dt2[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt2[[names(dt2)[j]]]),
by = .(cut(get(names(dt2)[j]), breaks = brkLst2[[j]])) ,
.SDcols = j][]
}
dt2
# x1 x2 mux1_diff mux2_diff
# 1: 1 21 -4.0 -3.5
# 2: 2 22 -4.0 -3.5
# 3: 3 23 -4.0 -3.5
# 4: 4 24 -4.0 -3.5
# 5: 5 25 0.5 -3.5
# 6: 6 26 0.5 0.5
# 7: 7 27 0.5 0.5
# 8: 8 28 0.5 0.5
# 9: 9 29 0.5 4.0
#10: 10 30 4.5 4.0
#11: 11 31 4.5 4.0
#12: 12 32 4.5 4.0
dt1 <- data.table(x1 = c(1,2,3,4,5,6,7,9,10,11,12))[, x2 := x1 + 5][]
#OP's changed dataset
dt2 <- data.table(x1 = 1:12, x2=21:32)
Upvotes: 2