Passing a function with parameters as columns within double loop in R

Question

I have a problem in applying a function (some formula) which takes two parameters (two columns) and iterates through them within the data frame. So I have 20 variables and I want the function to recalculate all combinations of two variables and save the result in a variable (so in a way, the script should create 3 x 3 = 9 new variables). So I am thinking to use a double loop.

example:

dflist1 <- c(dat$DIM1, dat$DIM2, dat$DIM3)
k = 1
    for (i in dflist1) {
        for (j in dflist1) {

However, then I wish to apply function, which contains formula (and takes two parameters: one as one column and second as another column).

Example of function:

calc <- function(i,j)
{
    abs(i*j)/(i*j)*log(1+abs(i*j))
}

calc(i = dat$DIM1, j = dat$DIM2)

so back to for loop's - here comes the problem, when I try to apply the function and save it in another column, it only saves last calculated result (I haven't yet set the iteration of newly created variables):

for (i in dflist1) {

    for (j in dflist1) {

    dat$kk <- mapply(FUN = calc, i, j, SIMPLIFY = TRUE)
    print(dat$kk)
    }

k = k + 1

}

Can someone please help me with that? So I need to iterate calculation (by row) through all combos of columns and write results in new columns.

Uwe · Accepted Answer

The data.table solution uses the CJ() cross join function to create the 3 x 3 pairs of input parameters to the calc() function, dcast() to reshape the computed results from long to wide format, and a join to append the 9 computed columns to the original data.frame:

library(data.table)
setDT(df)[, rn := .I]
df[df[, CJ(c(a, b, c), c(a, b, c)), by = rn][
  , value := calc(V1, V2)][
    , dcast(.SD, rn ~ rowid(rn))], on = .(rn)][, !"rn"]

                a           b          c           1            2           3            4
   1: -0.56047565 -0.99579872 -0.5116037 0.688945918  0.443480566  0.41174822  0.443480566
   2: -0.23017749 -1.03995504  0.2369379 0.733091907  0.214606608 -0.22026320  0.214606608
   3:  1.55870831 -0.01798024 -0.5415892 0.257211652  0.009690796 -0.61203449  0.009690796
   4:  0.07050839 -0.13217513  1.2192276 0.017319415 -0.009276298 -0.14941225 -0.009276298
   5:  0.12928774 -2.54934277  0.1741359 2.014789492 -0.284877208 -0.36736997 -0.284877208
  ---                                                                                     
 996: -0.08997520  0.07664366  1.0609662 0.008062943 -0.006872360 -0.09117496 -0.006872360
 997:  1.07051604  0.25516476 -0.4455056 0.181050147 -0.107667460 -0.38995947 -0.107667460
 998: -1.35110039  0.27744682 -0.4291802 1.038675520  0.457339699 -0.31835082  0.457339699
 999: -0.52261670  0.53685602  1.1890118 0.241477031 -0.247305230 -0.48328838 -0.247305230
1000: -0.24919068 -0.46048557  0.8342941 0.192310630  0.108629008 -0.32510818  0.108629008
                 5           6           7           8          9
   1: 0.2731770948  0.25211300  0.41174822  0.25211300 0.23249043
   2: 0.0516258319 -0.05310253 -0.22026320 -0.05310253 0.05462033
   3: 0.0003232368 -0.02764041 -0.61203449 -0.02764041 1.23243536
   4: 0.0049591165  0.08246971 -0.14941225  0.08246971 0.91088256
   5: 0.0165771550  0.02226394 -0.36736997  0.02226394 0.02987264
  ---                                                            
 996: 0.0058570650  0.07817913 -0.09117496  0.07817913 0.75407734
 997: 0.0630771939  0.24150040 -0.38995947  0.24150040 0.76360778
 998: 0.1690637292 -0.11250215 -0.31835082 -0.11250215 0.07415780
 999: 0.2532570646  0.49367630 -0.48328838  0.49367630 0.88118117
1000: 0.0602443085 -0.18888191 -0.32510818 -0.18888191 0.52830001

Note that the computed values may appear in different columns when comparing with LAP's result.

Edit: Improved version for commutative function

Apparently, OP's function definition is commutative, i.e., calc(1, 2) returns the same value as calc(2, 1). This is why we find only 6 different computed values in each row.

In case of a commutative function we can save the computation of 3 duplicate values. So instead of doing a full cross join

CJ(1:3, 1:3)

we can use only unique combinations

CJ(1:3, 1:3)[V1 <= V2]

Note that this is just an illustrative example to explain the effect and can not be used stand-alone.

We need to modify the complete data.table expression:

df[df[, CJ(c(a, b, c), c(a, b, c), sorted = FALSE)[V1 <= V2], by = rn][
  , value := calc(V1, V2)][
    , dcast(.SD, rn ~ rowid(rn))], on = .(rn)][, !"rn"]

                a           b          c           1            2           3
   1: -0.56047565 -0.99579872 -0.5116037 0.688945918  0.443480566  0.41174822
   2: -0.23017749 -1.03995504  0.2369379 0.733091907  0.214606608 -0.22026320
   3:  1.55870831 -0.01798024 -0.5415892 0.257211652  0.009690796 -0.61203449
   4:  0.07050839 -0.13217513  1.2192276 0.017319415 -0.009276298 -0.14941225
   5:  0.12928774 -2.54934277  0.1741359 2.014789492 -0.284877208 -0.36736997
  ---                                                                        
 996: -0.08997520  0.07664366  1.0609662 0.008062943 -0.006872360 -0.09117496
 997:  1.07051604  0.25516476 -0.4455056 0.181050147 -0.107667460 -0.38995947
 998: -1.35110039  0.27744682 -0.4291802 1.038675520  0.457339699 -0.31835082
 999: -0.52261670  0.53685602  1.1890118 0.241477031 -0.247305230 -0.48328838
1000: -0.24919068 -0.46048557  0.8342941 0.192310630  0.108629008 -0.32510818
                 4           5          6
   1: 0.2731770948  0.25211300 0.23249043
   2: 0.0516258319 -0.05310253 0.05462033
   3: 0.0003232368 -0.02764041 1.23243536
   4: 0.0049591165  0.08246971 0.91088256
   5: 0.0165771550  0.02226394 0.02987264
  ---                                    
 996: 0.0058570650  0.07817913 0.75407734
 997: 0.0630771939  0.24150040 0.76360778
 998: 0.1690637292 -0.11250215 0.07415780
 999: 0.2532570646  0.49367630 0.88118117
1000: 0.0602443085 -0.18888191 0.52830001

Note that sorted = FALSE is required to keep the order of values as supplied to CJ(). CJ() will sort the values by default.

Edit 2: Saving to type column names

In case there are many more column names, it might be cumbersome to type all the column names twice for the cross join.

This can solved by the following modification:

df[df[, CJ(do.call("c", .SD), do.call("c", .SD), sorted = FALSE), by = rn][u
  , value := calc(V1, V2)][
    , dcast(.SD, rn ~ rowid(rn))], on = .(rn)][, !"rn"]

or

df[df[, CJ(do.call("c", .SD), do.call("c", .SD), sorted = FALSE)[V1 <= V2], by = rn][
  , value := calc(V1, V2)][
    , dcast(.SD, rn ~ rowid(rn))], on = .(rn)][, !"rn"]

for the commutative case.

do.call() constructs and executes a function call from a name or a function and a list of arguments to be passed to it. (see help(do.call)). .SD is a special symbol which denotes the subset of data for each group, excluding any columns used in by. As we are grouping by each row, here, .SD is a list with one value of each column which is passed to the c() function.

By referring to .SD, all columns of df are used in the cross join, except those in the by parameter. However, we can specify the columns to be included in the cross join with the .SDcols parameter, e.g.

df[df[, CJ(do.call("c", .SD), do.call("c", .SD), sorted = FALSE)[V1 <= V2], by = rn, 
      .SDcols = 1:2][
        , value := calc(V1, V2)][
          , dcast(.SD, rn ~ rowid(rn))], on = .(rn)][, !"rn"]

will only use the first two columns of df

                a           b          c            1            2           3
   1: -0.56047565 -0.99579872 -0.5116037 0.6889459178  0.443480566 0.273177095
   2: -0.23017749 -1.03995504  0.2369379 0.7330919071  0.214606608 0.051625832
   3:  1.55870831 -0.01798024 -0.5415892 0.0003232368 -0.027640410 1.232435358
   4:  0.07050839 -0.13217513  1.2192276 0.0173194151 -0.009276298 0.004959116
   5:  0.12928774 -2.54934277  0.1741359 2.0147894919 -0.284877208 0.016577155
  ---                                                                         
 996: -0.08997520  0.07664366  1.0609662 0.0080629430 -0.006872360 0.005857065
 997:  1.07051604  0.25516476 -0.4455056 0.0630771939  0.241500405 0.763607781
 998: -1.35110039  0.27744682 -0.4291802 1.0386755196 -0.318350818 0.074157797
 999: -0.52261670  0.53685602  1.1890118 0.2414770311 -0.247305230 0.253257065
1000: -0.24919068 -0.46048557  0.8342941 0.1923106299  0.108629008 0.060244308

Edit 3: Preserving the order of combinations when dropping duplicates

The parameter sorted = FALSE is required but not sufficient to keep the created combinations always in the same order according to the column numbering. This is caused by [V1 <= V2] comparing values but not positions of columns.

So we have to make sure that always the same rows are removed from the table of combinations. Here is a small example:

test <- data.table(rn = 1:3, 
                   a = LETTERS[c(3L, 1:2)],
                   b = LETTERS[c(2:3, 1L)], 
                   c = LETTERS[1:3])
test

   rn a b c
1:  1 C B A
2:  2 A C B
3:  3 B A C

# dropping duplicates by value
test[, CJ(c(a, b, c), c(a, b, c), sorted = FALSE)[,cn := .I][V1 <= V2], by = rn]

    rn V1 V2 cn
 1:  1  C  C  1
 2:  1  B  C  4
 3:  1  B  B  5
 4:  1  A  C  7
 5:  1  A  B  8
 6:  1  A  A  9
 7:  2  A  A  1
 8:  2  A  C  2
 9:  2  A  B  3
10:  2  C  C  5
11:  2  B  C  8
12:  2  B  B  9
13:  3  B  B  1
14:  3  B  C  3
15:  3  A  B  4
16:  3  A  A  5
17:  3  A  C  6
18:  3  C  C  9

# dropping duplicates by position
drop <- CJ(1:3, 1:3)[V1 > V2, which = TRUE]
test[, CJ(c(a, b, c), c(a, b, c), sorted = FALSE)[,cn := .I][-drop], by = rn]

    rn V1 V2 cn
 1:  1  C  C  1
 2:  1  C  B  2
 3:  1  C  A  3
 4:  1  B  B  5
 5:  1  B  A  6
 6:  1  A  A  9
 7:  2  A  A  1
 8:  2  A  C  2
 9:  2  A  B  3
10:  2  C  C  5
11:  2  C  B  6
12:  2  B  B  9
13:  3  B  B  1
14:  3  B  A  2
15:  3  B  C  3
16:  3  A  A  5
17:  3  A  C  6
18:  3  C  C  9

For illustration, the created combinations have been numbered consecutively before being filtered. Filtering by position keeps the same combinations cn for each input row rn.

If all columns of df are to be used for creating combinations while keeping positions the code becomes finally

drop <- CJ(seq_along(df), seq_along(df))[V1 > V2, which = TRUE]
setDT(df)[, rn := .I] # execution order is important, drop needs to be computed first
df[df[, CJ(do.call("c", .SD), do.call("c", .SD), sorted = FALSE)[-drop], by = rn][
        , value := calc(V1, V2)][
          , dcast(.SD, rn ~ rowid(rn))], on = .(rn)][, !"rn"]

                a           b          c           1            2           3            4
   1: -0.56047565 -0.99579872 -0.5116037 0.273177095  0.443480566  0.25211300 0.6889459178
   2: -0.23017749 -1.03995504  0.2369379 0.051625832  0.214606608 -0.05310253 0.7330919071
   3:  1.55870831 -0.01798024 -0.5415892 1.232435358 -0.027640410 -0.61203449 0.0003232368
   4:  0.07050839 -0.13217513  1.2192276 0.004959116 -0.009276298  0.08246971 0.0173194151
   5:  0.12928774 -2.54934277  0.1741359 0.016577155 -0.284877208  0.02226394 2.0147894919
  ---                                                                                     
 996: -0.08997520  0.07664366  1.0609662 0.008062943 -0.006872360 -0.09117496 0.0058570650
 997:  1.07051604  0.25516476 -0.4455056 0.763607781  0.241500405 -0.38995947 0.0630771939
 998: -1.35110039  0.27744682 -0.4291802 1.038675520 -0.318350818  0.45733970 0.0741577975
 999: -0.52261670  0.53685602  1.1890118 0.241477031 -0.247305230 -0.48328838 0.2532570646
1000: -0.24919068 -0.46048557  0.8342941 0.060244308  0.108629008 -0.18888191 0.1923106299
                 5          6
   1:  0.411748217 0.23249043
   2: -0.220263201 0.05462033
   3:  0.009690796 0.25721165
   4: -0.149412251 0.91088256
   5: -0.367369972 0.02987264
  ---                        
 996:  0.078179132 0.75407734
 997: -0.107667460 0.18105015
 998: -0.112502154 0.16906373
 999:  0.493676298 0.88118117
1000: -0.325108180 0.52830001

Passing a function with parameters as columns within double loop in R

Answers (2)

Edit: Improved version for commutative function

Edit 2: Saving to type column names

Edit 3: Preserving the order of combinations when dropping duplicates

Related Questions