panman
panman

Reputation: 1341

R: data.table compute weighted means of multiple variables with multiple weight variables each, by group

I am still new to data.table. My question is similar to this one and this one. The difference is that I want to compute weighted means for multiple variables, by groups, but using more than one weight for each mean.

Consider the following data.table (the actual is much larger):

library(data.table)

set.seed(123456)

mydata <- data.table(CLID = rep("CNK", 10),
                     ITNUM = rep(c("First", "Second", "First", "First", "Second"), 2),
                     SATS = rep(c("Always", "Amost always", "Sometimes", "Never", "Always"), 2),
                     ASSETS = rep(c("0-10", "11-25", "26-100", "101-200", "MORE THAN 200"), 2),
                     AVGVALUE1 = rnorm(10, 10, 2),
                     AVGVALUE2 = rnorm(10, 10, 2),
                     WGT1 = rnorm(10, 3, 1),
                     WGT2 = rnorm(10, 3, 1),
                     WGT3 = rnorm(10, 3, 1))

#I set the key of the table to the variables I want to group by,
#so the output is sorted
setkeyv(mydata, c("CLID", "ITNUM", "SATS", "ASSETS"))

What I want to achieve is to compute the weighted means for AVGVALUE1 and AVGVALUE2 (and possibly more variables) by groups defined by ITNUM, SATS, ASSETS using each one of the weight variables WGT1, WGT2, WGT3 (and possibly more of them). So for each of the variables I want to compute the weighted means I will have three weighted means by groups (or whatever the number of weights is).

I can do it for each variable separately, e.g.:

all.weights <- c("WGT1", "WGT2", "WGT3")
avg.var <- "AVGVALUE1"
split.vars <- c("ITNUM", "SATS", "ASSETS")

mydata[ , Map(f = weighted.mean, x = .(get(avg.var)), w = mget(all.weights),
na.rm = TRUE), by = c(key(mydata)[1], split.vars)]

I add the first key variable in by, although it is a constant, because I would like to have it as a column in the output. And I get:

   CLID  ITNUM         SATS        ASSETS       V1       V2       V3
1:  CNK  First       Always          0-10 11.66824 11.66819 11.66829
2:  CNK  First        Never       101-200 11.37378 12.21008 11.60182
3:  CNK  First    Sometimes        26-100 12.43004 13.13450 12.01330
4:  CNK Second       Always MORE THAN 200 12.32265 11.81613 12.56786
5:  CNK Second Amost always         11-25 10.76556 11.34669 10.52458

However, with the actual data.table, where I have much more columns to compute the weighted means for (as well as much more weights to use), it would be rather cumberosme to do it one by one. What I imagine is a function where the mean for each of the variables (AVGVALUE1, AVGVALUE2 and so on) is computed with each of the weight variables (WGT1, WGT2, WGT3 and so on) and the output for each variable for which the weighted mean is computed is added to a list. I guess the list would be the best option because if all estimates are in the same output, the number of columns might be endless. So something like this:

[[1]]
   CLID  ITNUM         SATS        ASSETS       V1       V2       V3
1:  CNK  First       Always          0-10 11.66824 11.66819 11.66829
2:  CNK  First        Never       101-200 11.37378 12.21008 11.60182
3:  CNK  First    Sometimes        26-100 12.43004 13.13450 12.01330
4:  CNK Second       Always MORE THAN 200 12.32265 11.81613 12.56786
5:  CNK Second Amost always         11-25 10.76556 11.34669 10.52458

[[2]]
   CLID  ITNUM         SATS        ASSETS        V1        V2        V3
1:  CNK  First       Always          0-10  9.132899  9.060045  9.197005
2:  CNK  First        Never       101-200 12.896584 13.278680 13.000772
3:  CNK  First    Sometimes        26-100 10.972260 11.215390 10.828431
4:  CNK Second       Always MORE THAN 200 11.704404 11.611072 11.749586
5:  CNK Second Amost always         11-25  8.086409  8.225030  8.028928

What I tried so far:

  1. Using lapply

    all.weights <- c("WGT1", "WGT2", "WGT3")
    avg.vars <- c("AVGVALUE1", "AVGVALUE2")
    split.vars <- c("ITNUM", "SATS", "ASSETS")
    
    lapply(mydata, function(i) {
    mydata[ , Map(f = weighted.mean, x = mget(avg.vars)[i], w = mget(all.weights),
    na.rm = TRUE), by = c(key(mydata)[1], split.vars)]
    })
    
    Error in weighted.mean.default(x = dots[[1L]][[1L]], w = dots[[2L]][[1L]],  : 
     'x' and 'w' must have the same length
    
  2. Using mapply

    myfun <- function(data, spl.v, avg.v, wgts) {
      data[ , Map(f = weighted.mean, x = mget(avg.v), w = mget(all.weights),
      na.rm = TRUE), by = c(key(data)[1], spl.v)]
    }
    
    mapply(FUN = myfun, data = mydata, spl.v = split.vars, avg.v = avg.vars,
    wgts = all.weights)
    
    Error: value for ‘AVGVALUE2’ not found
    

I tried to wrap the mget(avg.v) as a list - .(mget(avg.v)), but then get another error:

 Error in mapply(FUN = f, ..., SIMPLIFY = FALSE) : 
  could not find function "." 

Can someone help?

Upvotes: 3

Views: 2071

Answers (2)

dww
dww

Reputation: 31454

We can use outer (which performs a function on all combinations of the values in two input vectors) operating on a vectorized weighted means function. By defining the function used by outer within the scope of the data table, we can have get evaluate to the data.table columns:

wmeans = mydata[, {
  f  = function(X,Y) weighted.mean(get(X), get(Y));
  vf = Vectorize(f);
  outer(avg.var, all.weights, vf)},
  by = split.vars]

This puts all the means into a single column (i.e. 'long' format). We can also add a couple more columns to specify which value/weight combination each refers to:

wmeans[, mean.v := expand.grid(avg.var, all.weights)[,1]]       
wmeans[, mean.w := expand.grid(avg.var, all.weights)[,2]]
head(wmeans)
#    ITNUM   SATS ASSETS        V1    mean.v mean.w
# 1: First Always   0-10 11.668243 AVGVALUE1   WGT1
# 2: First Always   0-10  9.132899 AVGVALUE2   WGT1
# 3: First Always   0-10 11.668192 AVGVALUE1   WGT2
# 4: First Always   0-10  9.060045 AVGVALUE2   WGT2
# 5: First Always   0-10 11.668287 AVGVALUE1   WGT3
# 6: First Always   0-10  9.197005 AVGVALUE2   WGT3

We can use dcast to reshape this into a data.table that is long in avg.var, but wide in all.weights:

wide.wmeans = dcast(wmeans, mean.v+ITNUM+SATS+ASSETS ~ mean.w, value.var = "V1")  
#       mean.v  ITNUM         SATS        ASSETS      WGT1      WGT2      WGT3
# 1: AVGVALUE1  First       Always          0-10 11.668243 11.668192 11.668287
# 2: AVGVALUE1  First        Never       101-200 11.373780 12.210083 11.601819
# 3: AVGVALUE1  First    Sometimes        26-100 12.430039 13.134499 12.013299
# 4: AVGVALUE1 Second       Always MORE THAN 200 12.322651 11.816135 12.567860
# 5: AVGVALUE1 Second Amost always         11-25 10.765557 11.346688 10.524583
# 6: AVGVALUE2  First       Always          0-10  9.132899  9.060045  9.197005
# 7: AVGVALUE2  First        Never       101-200 12.896584 13.278680 13.000772
# 8: AVGVALUE2  First    Sometimes        26-100 10.972260 11.215390 10.828431
# 9: AVGVALUE2 Second       Always MORE THAN 200 11.704404 11.611072 11.749586
#10: AVGVALUE2 Second Amost always         11-25  8.086409  8.225030  8.028928

If you need this as a list rather than a data.table, you can split it up using

lapply(avg.var, function(x) wide.wmeans[mean.v == x])
# [[1]]
#       mean.v  ITNUM         SATS        ASSETS     WGT1     WGT2     WGT3
# 1: AVGVALUE1  First       Always          0-10 11.66824 11.66819 11.66829
# 2: AVGVALUE1  First        Never       101-200 11.37378 12.21008 11.60182
# 3: AVGVALUE1  First    Sometimes        26-100 12.43004 13.13450 12.01330
# 4: AVGVALUE1 Second       Always MORE THAN 200 12.32265 11.81613 12.56786
# 5: AVGVALUE1 Second Amost always         11-25 10.76556 11.34669 10.52458
# 
# [[2]]
#       mean.v  ITNUM         SATS        ASSETS      WGT1      WGT2      WGT3
# 1: AVGVALUE2  First       Always          0-10  9.132899  9.060045  9.197005
# 2: AVGVALUE2  First        Never       101-200 12.896584 13.278680 13.000772
# 3: AVGVALUE2  First    Sometimes        26-100 10.972260 11.215390 10.828431
# 4: AVGVALUE2 Second       Always MORE THAN 200 11.704404 11.611072 11.749586
# 5: AVGVALUE2 Second Amost always         11-25  8.086409  8.225030  8.028928

Upvotes: 2

Hack-R
Hack-R

Reputation: 23231

I. lapply solution

all.weights <- c("WGT1", "WGT2", "WGT3")
avg.vars    <- c("AVGVALUE1", "AVGVALUE2")
split.vars  <- c("ITNUM", "SATS", "ASSETS")

myfun <- function(avg.vars){
  tmp <-
    mydata[ , Map(f = weighted.mean, 
                x = .(get(avg.vars)), 
                w = mget(all.weights),
                na.rm = TRUE), 
          by = c(key(mydata)[1], split.vars)]  

  return(tmp) # totally optional, a habit from using C and Java
}

lapply(avg.vars, myfun)

Up-sides:

  • Makes use of an *apply
  • Avoids loop
  • Way faster than doing it one by one

Down-sides:

  • Returns a list
[[1]]
   CLID  ITNUM         SATS        ASSETS       V1       V2       V3
1:  CNK  First       Always          0-10 11.66824 11.66819 11.66829
2:  CNK  First        Never       101-200 11.37378 12.21008 11.60182
3:  CNK  First    Sometimes        26-100 12.43004 13.13450 12.01330
4:  CNK Second       Always MORE THAN 200 12.32265 11.81613 12.56786
5:  CNK Second Amost always         11-25 10.76556 11.34669 10.52458

[[2]]
   CLID  ITNUM         SATS        ASSETS        V1        V2        V3
1:  CNK  First       Always          0-10  9.132899  9.060045  9.197005
2:  CNK  First        Never       101-200 12.896584 13.278680 13.000772
3:  CNK  First    Sometimes        26-100 10.972260 11.215390 10.828431
4:  CNK Second       Always MORE THAN 200 11.704404 11.611072 11.749586
5:  CNK Second Amost always         11-25  8.086409  8.225030  8.028928

II. for loop solution

Using a simple for loop with the example where avg.vars has 2 values:

all.weights <- c("WGT1", "WGT2", "WGT3")
avg.vars    <- c("AVGVALUE1", "AVGVALUE2")
split.vars  <- c("ITNUM", "SATS", "ASSETS")

result <- data.frame(matrix(nrow=0,ncol=7))
for(i in avg.vars){
  tmp <- 
    mydata[ , Map(f = weighted.mean, 
                x = .(get(i)), 
                w = mget(all.weights),
                na.rm = TRUE), 
          by = c(key(mydata)[1], split.vars)]  

  result <- rbind(result,tmp,use.names=F)
}
colnames(result) <- c("CLID", "ITNUM", "SATS", "ASSETS", "V1", "V2", "V3")
result
    CLID  ITNUM         SATS        ASSETS        V1        V2        V3
 1:  CNK  First       Always          0-10 11.668243 11.668192 11.668287
 2:  CNK  First        Never       101-200 11.373780 12.210083 11.601819
 3:  CNK  First    Sometimes        26-100 12.430039 13.134499 12.013299
 4:  CNK Second       Always MORE THAN 200 12.322651 11.816135 12.567860
 5:  CNK Second Amost always         11-25 10.765557 11.346688 10.524583
 6:  CNK  First       Always          0-10  9.132899  9.060045  9.197005
 7:  CNK  First        Never       101-200 12.896584 13.278680 13.000772
 8:  CNK  First    Sometimes        26-100 10.972260 11.215390 10.828431
 9:  CNK Second       Always MORE THAN 200 11.704404 11.611072 11.749586
10:  CNK Second Amost always         11-25  8.086409  8.225030  8.028928

Up-sides:

  • Completes instantly in the example
  • Scales to any number of columns with no additional data manipulation/coding
  • Will save a huge amount of time over going one by one
  • Returns a nice data.table
  • if you actually want a list you can get that by initializing return as a list (return <- list()), creating a counter variable (n <- 1) then replacing the rbind statement with return[n] <- tmp and incrementing the counter (n <- n + 1) within the loop

Down-sides:

  • If your data is very large (e.g. > 100,000 rows and dozens or more values of avg.var) then performance of any loop or function written with loops will be poor

Upvotes: 1

Related Questions