How to iterate through all combinations of columns and apply function by group in R?

Question

I have the following data.table named dt

  set.seed(1)
  dt <- data.table(expand.grid(c("a","b"),1:2,1:2,c("M","N","O","P","Q")))
  dt$perf <- rnorm(nrow(dt),0,.01)
  colnames(dt) <- c("ticker","par1","par2","row_names","perf")

My goal is to iterate through all combinations of par1 and par2 by row_names and pick the one that maximizes cumprod(mean(perf)+1)-1. Let's look at the data so this makes more sense visually.

dt[order(row_names,ticker,par1,par2)]
    ticker par1 par2 row_names         perf
 1:      a    1    1         M  0.011462284
 2:      a    1    2         M -0.004252677
 3:      a    2    1         M  0.005727396
 4:      a    2    2         M -0.003892372
 5:      b    1    1         M -0.024030962
 6:      b    1    2         M  0.009510128
 7:      b    2    1         M  0.003747244
 8:      b    2    2         M -0.002843307

For each ticker and row_names we have 2 x 2 = 4 combinations of par1 and par2, namely, (1,1) (1,2) (2,1) (2,2).

I would like to calculate the mean of perf associated with ticker = a, par1 = 1, par2 = 1 with all the perf associated with all other combinations for ticker = b. Using numbers from the image above,

res
       a_perf       b_perf
1: 0.01146228 -0.024030962
2: 0.01146228  0.009510128
3: 0.01146228  0.003747244
4: 0.01146228 -0.002843307

apply(res,1,mean)
[1] -0.006284339  0.010486206  0.007604764  0.004309488

Then, we repeat this process for ticker = a, par1 = 1, par2 = 2 with all other combinations for ticker = b.

We would repeat this process for all combinations of par1 and par2 with each row_names.

EDIT::: Using @earch's suggestion we get the following:

tmp <- lapply(split(dt, dt$row_names), calcCombMeans)
$M
   a.row b.row          mean
1      1     2 -0.0022140524
2      3     2 -0.0032599264
3      5     2  0.0025657555
4      7     2  0.0033553619
5      1     4  0.0048441350
6      3     4  0.0037982609
7      5     4  0.0096239429
8      7     4  0.0104135493
9      1     6 -0.0072346110
10     3     6 -0.0082804850
11     5     6 -0.0024548031
12     7     6 -0.0016651967
13     1     8  0.0005593545
14     3     8 -0.0004865195
15     5     8  0.0053391624
16     7     8  0.0061287688

From here, I would like to pick the max(mean) for row_names M,N,O,P,Q. One way to do that would be this if I did not care about referencing indices later on:

res <- sapply(1:length(tmp),function(i) which.max(tmp[[i]]$perf))
[1]  8  6  3 12 16

This would be how I would calculate my desired end-result with completion:

res <- rbindlist(tmp,id="row_names")
  res <- res[,list(best=max(perf),best_idx = which.max(perf)),by=row_names]
   row_names        best best_idx
1:         M 0.010413549        8
2:         N 0.009508122        6
3:         O 0.009314068        3
4:         P 0.008883106       12
5:         Q 0.009316006       16

I haven't decided whether I need the best_idx information (I probably will in order to replicate the exact calculation of a specific row_names), but using this res, I can calculate my cumRet by doing:

res[,cumRet:= cumprod(best+1)-1]
> res
   row_names        best best_idx      cumRet
1:         M 0.010413549        8 0.01041355
2:         N 0.009508122        6 0.02002068
3:         O 0.009314068        3 0.02952123
4:         P 0.008883106       12 0.03866657
5:         Q 0.009316006       16 0.04834280

@earch's really helps being able to see the process of calculating all these combinations. I was wondering if there was a more efficient solution through using data.table's functionality. My real data set is much larger than this (millions of rows), and the combinations will start to take a toll.

EDIT #2::: After being able to step through the process, I have figured out a very fast solution!

tmp <- dt[,list(par1=par1[which.max(perf)],par2=par2[which.max(perf)],perf=max(perf)),by=list(ticker,row_names)]
    res <- tmp[,list(perf=mean(perf),par1= paste(par1,collapse=","),par2=paste(par2,collapse=",")),by=row_names]

Using data.table allows me to calculate the max perf by group and ticker combinations. Then after doing that, I can group by row_names. And it gets the same results!

> res
   row_names        perf par1 par2
1:         M 0.010413549  2,2  2,1
2:         N 0.009508122  2,2  1,1
3:         O 0.009314068  1,1  2,1
4:         P 0.008883106  2,1  2,2
5:         Q 0.009316006  2,2  2,2

road_to_quantdom · Accepted Answer

EDIT #2::: After being able to step through the process, I have figured out a very fast solution!

tmp <- dt[,list(par1=par1[which.max(perf)],par2=par2[which.max(perf)],
                                           perf=max(perf)),
                                           by=list(ticker,row_names)]
res <- tmp[,list(perf=mean(perf),par1= paste(par1,collapse=","),
                                          par2=paste(par2,collapse=",")),by=row_names]

Using data.table allows me to calculate the max perf by group and ticker combinations. Then after doing that, I can group by row_names. And it gets the same results!

> res
   row_names        perf par1 par2
1:         M 0.010413549  2,2  2,1
2:         N 0.009508122  2,2  1,1
3:         O 0.009314068  1,1  2,1
4:         P 0.008883106  2,1  2,2
5:         Q 0.009316006  2,2  2,2

How to iterate through all combinations of columns and apply function by group in R?

Answers (2)

Related Questions