newbie
newbie

Reputation: 907

Find mean difference by group on variables using data.table

Supposed, I have the following data.table

library(data.table)
dt <- data.table(x1 = c(1:12), x2=c(21:32))

Then, I create bins by user-specified intervals using the following:

dt[,intx1:=cut(x1, breaks = c(-Inf, 4, 9, Inf))]

return,

    x1 x2    intx1
 1:  1 21 (-Inf,4]
 2:  2 22 (-Inf,4]
 3:  3 23 (-Inf,4]
 4:  4 24 (-Inf,4]
 5:  5 25    (4,9]
 6:  6 26    (4,9]
 7:  7 27    (4,9]
 8:  8 28    (4,9]
 9:  9 29    (4,9]
10: 10 30 (9, Inf]
11: 11 31 (9, Inf]
12: 12 32 (9, Inf]

I am trying to find the mean difference between bins and variable:

dt[, mux1_grp:=mean(x1), by = intx1][,mux1_pop:=mean(x1)][,mux1_diff:=mux1_grp-mux1_pop]
dt[,`:=`(intx1=NULL, mux1_grp=NULL, mux1_pop=NULL)]

The return is:

    x1 x2 mux1_diff
 1:  1 21      -4.0
 2:  2 22      -4.0
 3:  3 23      -4.0
 4:  4 24      -4.0
 5:  5 25       0.5
 6:  6 26       0.5
 7:  7 27       0.5
 8:  8 28       0.5
 9:  9 29       0.5
10: 10 30       4.5
11: 11 31       4.5
12: 12 32       4.5

However, my original data contain several variables (e.g., x1, x2,...,x20).
So, I have to repeat the same procedure for x2 as follows:

dt[,intx2:=cut(x2, breaks = c(-Inf, 25, 28, Inf))]
dt[, mux2_grp:=mean(x2), by = intx2][,mux2_pop:=mean(x2)][,mux2_diff:=mux2_grp-mux2_pop]
dt[,`:=`(intx2=NULL, mux2_grp=NULL, mux2_pop=NULL)]

My final output will be:

    x1 x2 mux1_diff mux2_diff
 1:  1 21      -4.0      -3.5
 2:  2 22      -4.0      -3.5
 3:  3 23      -4.0      -3.5
 4:  4 24      -4.0      -3.5
 5:  5 25       0.5      -3.5
 6:  6 26       0.5       0.5
 7:  7 27       0.5       0.5
 8:  8 28       0.5       0.5
 9:  9 29       0.5       4.0
10: 10 30       4.5       4.0
11: 11 31       4.5       4.0
12: 12 32       4.5       4.0

How can I improve this code? Note that each variable has different user-specified intervals

Upvotes: 3

Views: 447

Answers (1)

akrun
akrun

Reputation: 887118

We can do this in a compact one-step option (though it may be not that optimal when compared to the OP's method (from @Frank's comments)

dt[, mu_diff := mean(x) - mean(dt$x), by = .(cut(x, breaks = c(-Inf, 4, 9, Inf)))][]
#    x    mu_diff
#1:  1 -3.8636364
#2:  2 -3.8636364
#3:  3 -3.8636364
#4:  4 -3.8636364
#5:  5  0.3863636
#6:  6  0.3863636
#7:  7  0.3863636
#8:  9  0.3863636
#9: 10  4.6363636
#10:11  4.6363636
#11:12  4.6363636

If there are many variables (it is not clear whether we use the same breaks in cut or not for different columns - assuming that it is the same), we can loop through the columns (in the reproducible example below, two variables 'x1' and 'x2' are showed), specify the .SDcols by the numeric index of the column, grouped by the cut of the subsetted column, we assign the new column as the difference between the mean of the values within the group and the mean of the entire column.

nm1 <- paste0("mu_diff", seq_along(dt1))
for(j in seq_along(dt1)){
  dt1[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt1[[names(dt1)[j]]]), 
      by = .(cut(get(names(dt1)[j]), breaks = c(-Inf, 4, 9, Inf))) ,
              .SDcols = j][]
 }

Update - Suppose if the breaks for cut variable is different for each column, place it in a list and get that list element inside the for loop with the indexing.

brkLst <- list(c(-Inf, 4, 9, Inf), c(-Inf, 10, 14, Inf))
for(j in seq_along(dt1)){
  dt1[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt1[[names(dt1)[j]]]), 
      by = .(cut(get(names(dt1)[j]), breaks = brkLst[[j]])) ,
              .SDcols = j][]
 }

Checking the output with OP's new data ('dt2')

brkLst2 <- list(c(-Inf, 4, 9, Inf),  c(-Inf, 25, 28, Inf))
nm1 <- paste0("mu", names(dt2), "_diff")
for(j in seq_along(dt2)){
   dt2[, (nm1[j]) := mean(.SD[[1L]]) - mean(dt2[[names(dt2)[j]]]), 
  by = .(cut(get(names(dt2)[j]), breaks = brkLst2[[j]])) ,
          .SDcols = j][]
}

dt2
#    x1 x2 mux1_diff mux2_diff
# 1:  1 21      -4.0      -3.5
# 2:  2 22      -4.0      -3.5
# 3:  3 23      -4.0      -3.5
# 4:  4 24      -4.0      -3.5
# 5:  5 25       0.5      -3.5
# 6:  6 26       0.5       0.5
# 7:  7 27       0.5       0.5
# 8:  8 28       0.5       0.5
# 9:  9 29       0.5       4.0
#10: 10 30       4.5       4.0
#11: 11 31       4.5       4.0
#12: 12 32       4.5       4.0

data

dt1 <- data.table(x1 = c(1,2,3,4,5,6,7,9,10,11,12))[, x2 := x1 + 5][]
#OP's changed dataset
dt2 <- data.table(x1 = 1:12, x2=21:32)

Upvotes: 2

Related Questions