Reputation: 77
I have a problem in applying a function (some formula) which takes two parameters (two columns) and iterates through them within the data frame. So I have 20 variables and I want the function to recalculate all combinations of two variables and save the result in a variable (so in a way, the script should create 3 x 3 = 9 new variables). So I am thinking to use a double loop.
example:
dflist1 <- c(dat$DIM1, dat$DIM2, dat$DIM3)
k = 1
for (i in dflist1) {
for (j in dflist1) {
However, then I wish to apply function, which contains formula (and takes two parameters: one as one column and second as another column).
Example of function:
calc <- function(i,j)
{
abs(i*j)/(i*j)*log(1+abs(i*j))
}
calc(i = dat$DIM1, j = dat$DIM2)
so back to for loop's - here comes the problem, when I try to apply the function and save it in another column, it only saves last calculated result (I haven't yet set the iteration of newly created variables):
for (i in dflist1) {
for (j in dflist1) {
dat$kk <- mapply(FUN = calc, i, j, SIMPLIFY = TRUE)
print(dat$kk)
}
k = k + 1
}
Can someone please help me with that? So I need to iterate calculation (by row) through all combos of columns and write results in new columns.
Upvotes: 2
Views: 1262
Reputation: 42544
The data.table
solution uses the CJ()
cross join function to create the 3 x 3 pairs of input parameters to the calc()
function, dcast()
to reshape the computed results from long to wide format, and a join to append the 9 computed columns to the original data.frame:
library(data.table)
setDT(df)[, rn := .I]
df[df[, CJ(c(a, b, c), c(a, b, c)), by = rn][
, value := calc(V1, V2)][
, dcast(.SD, rn ~ rowid(rn))], on = .(rn)][, !"rn"]
a b c 1 2 3 4 1: -0.56047565 -0.99579872 -0.5116037 0.688945918 0.443480566 0.41174822 0.443480566 2: -0.23017749 -1.03995504 0.2369379 0.733091907 0.214606608 -0.22026320 0.214606608 3: 1.55870831 -0.01798024 -0.5415892 0.257211652 0.009690796 -0.61203449 0.009690796 4: 0.07050839 -0.13217513 1.2192276 0.017319415 -0.009276298 -0.14941225 -0.009276298 5: 0.12928774 -2.54934277 0.1741359 2.014789492 -0.284877208 -0.36736997 -0.284877208 --- 996: -0.08997520 0.07664366 1.0609662 0.008062943 -0.006872360 -0.09117496 -0.006872360 997: 1.07051604 0.25516476 -0.4455056 0.181050147 -0.107667460 -0.38995947 -0.107667460 998: -1.35110039 0.27744682 -0.4291802 1.038675520 0.457339699 -0.31835082 0.457339699 999: -0.52261670 0.53685602 1.1890118 0.241477031 -0.247305230 -0.48328838 -0.247305230 1000: -0.24919068 -0.46048557 0.8342941 0.192310630 0.108629008 -0.32510818 0.108629008 5 6 7 8 9 1: 0.2731770948 0.25211300 0.41174822 0.25211300 0.23249043 2: 0.0516258319 -0.05310253 -0.22026320 -0.05310253 0.05462033 3: 0.0003232368 -0.02764041 -0.61203449 -0.02764041 1.23243536 4: 0.0049591165 0.08246971 -0.14941225 0.08246971 0.91088256 5: 0.0165771550 0.02226394 -0.36736997 0.02226394 0.02987264 --- 996: 0.0058570650 0.07817913 -0.09117496 0.07817913 0.75407734 997: 0.0630771939 0.24150040 -0.38995947 0.24150040 0.76360778 998: 0.1690637292 -0.11250215 -0.31835082 -0.11250215 0.07415780 999: 0.2532570646 0.49367630 -0.48328838 0.49367630 0.88118117 1000: 0.0602443085 -0.18888191 -0.32510818 -0.18888191 0.52830001
Note that the computed values may appear in different columns when comparing with LAP's result.
Apparently, OP's function definition is commutative, i.e., calc(1, 2)
returns the same value as calc(2, 1)
. This is why we find only 6 different computed values in each row.
In case of a commutative function we can save the computation of 3 duplicate values. So instead of doing a full cross join
CJ(1:3, 1:3)
V1 V2 1: 1 1 2: 1 2 3: 1 3 4: 2 1 5: 2 2 6: 2 3 7: 3 1 8: 3 2 9: 3 3
we can use only unique combinations
CJ(1:3, 1:3)[V1 <= V2]
V1 V2 1: 1 1 2: 1 2 3: 1 3 4: 2 2 5: 2 3 6: 3 3
Note that this is just an illustrative example to explain the effect and can not be used stand-alone.
We need to modify the complete data.table
expression:
df[df[, CJ(c(a, b, c), c(a, b, c), sorted = FALSE)[V1 <= V2], by = rn][
, value := calc(V1, V2)][
, dcast(.SD, rn ~ rowid(rn))], on = .(rn)][, !"rn"]
a b c 1 2 3 1: -0.56047565 -0.99579872 -0.5116037 0.688945918 0.443480566 0.41174822 2: -0.23017749 -1.03995504 0.2369379 0.733091907 0.214606608 -0.22026320 3: 1.55870831 -0.01798024 -0.5415892 0.257211652 0.009690796 -0.61203449 4: 0.07050839 -0.13217513 1.2192276 0.017319415 -0.009276298 -0.14941225 5: 0.12928774 -2.54934277 0.1741359 2.014789492 -0.284877208 -0.36736997 --- 996: -0.08997520 0.07664366 1.0609662 0.008062943 -0.006872360 -0.09117496 997: 1.07051604 0.25516476 -0.4455056 0.181050147 -0.107667460 -0.38995947 998: -1.35110039 0.27744682 -0.4291802 1.038675520 0.457339699 -0.31835082 999: -0.52261670 0.53685602 1.1890118 0.241477031 -0.247305230 -0.48328838 1000: -0.24919068 -0.46048557 0.8342941 0.192310630 0.108629008 -0.32510818 4 5 6 1: 0.2731770948 0.25211300 0.23249043 2: 0.0516258319 -0.05310253 0.05462033 3: 0.0003232368 -0.02764041 1.23243536 4: 0.0049591165 0.08246971 0.91088256 5: 0.0165771550 0.02226394 0.02987264 --- 996: 0.0058570650 0.07817913 0.75407734 997: 0.0630771939 0.24150040 0.76360778 998: 0.1690637292 -0.11250215 0.07415780 999: 0.2532570646 0.49367630 0.88118117 1000: 0.0602443085 -0.18888191 0.52830001
Note that sorted = FALSE
is required to keep the order of values as supplied to CJ()
. CJ()
will sort the values by default.
In case there are many more column names, it might be cumbersome to type all the column names twice for the cross join.
This can solved by the following modification:
df[df[, CJ(do.call("c", .SD), do.call("c", .SD), sorted = FALSE), by = rn][u
, value := calc(V1, V2)][
, dcast(.SD, rn ~ rowid(rn))], on = .(rn)][, !"rn"]
or
df[df[, CJ(do.call("c", .SD), do.call("c", .SD), sorted = FALSE)[V1 <= V2], by = rn][
, value := calc(V1, V2)][
, dcast(.SD, rn ~ rowid(rn))], on = .(rn)][, !"rn"]
for the commutative case.
do.call()
constructs and executes a function call from a name or a function and a list of arguments to be passed to it. (see help(do.call)
). .SD
is a special symbol which denotes the subset of data for each group, excluding any columns used in by
. As we are grouping by each row, here, .SD
is a list with one value of each column which is passed to the c()
function.
By referring to .SD
, all columns of df
are used in the cross join, except those in the by
parameter. However, we can specify the columns to be included in the cross join with the .SDcols
parameter, e.g.
df[df[, CJ(do.call("c", .SD), do.call("c", .SD), sorted = FALSE)[V1 <= V2], by = rn,
.SDcols = 1:2][
, value := calc(V1, V2)][
, dcast(.SD, rn ~ rowid(rn))], on = .(rn)][, !"rn"]
will only use the first two columns of df
a b c 1 2 3 1: -0.56047565 -0.99579872 -0.5116037 0.6889459178 0.443480566 0.273177095 2: -0.23017749 -1.03995504 0.2369379 0.7330919071 0.214606608 0.051625832 3: 1.55870831 -0.01798024 -0.5415892 0.0003232368 -0.027640410 1.232435358 4: 0.07050839 -0.13217513 1.2192276 0.0173194151 -0.009276298 0.004959116 5: 0.12928774 -2.54934277 0.1741359 2.0147894919 -0.284877208 0.016577155 --- 996: -0.08997520 0.07664366 1.0609662 0.0080629430 -0.006872360 0.005857065 997: 1.07051604 0.25516476 -0.4455056 0.0630771939 0.241500405 0.763607781 998: -1.35110039 0.27744682 -0.4291802 1.0386755196 -0.318350818 0.074157797 999: -0.52261670 0.53685602 1.1890118 0.2414770311 -0.247305230 0.253257065 1000: -0.24919068 -0.46048557 0.8342941 0.1923106299 0.108629008 0.060244308
The parameter sorted = FALSE
is required but not sufficient to keep the created combinations always in the same order according to the column numbering. This is caused by [V1 <= V2]
comparing values but not positions of columns.
So we have to make sure that always the same rows are removed from the table of combinations. Here is a small example:
test <- data.table(rn = 1:3,
a = LETTERS[c(3L, 1:2)],
b = LETTERS[c(2:3, 1L)],
c = LETTERS[1:3])
test
rn a b c 1: 1 C B A 2: 2 A C B 3: 3 B A C
# dropping duplicates by value
test[, CJ(c(a, b, c), c(a, b, c), sorted = FALSE)[,cn := .I][V1 <= V2], by = rn]
rn V1 V2 cn 1: 1 C C 1 2: 1 B C 4 3: 1 B B 5 4: 1 A C 7 5: 1 A B 8 6: 1 A A 9 7: 2 A A 1 8: 2 A C 2 9: 2 A B 3 10: 2 C C 5 11: 2 B C 8 12: 2 B B 9 13: 3 B B 1 14: 3 B C 3 15: 3 A B 4 16: 3 A A 5 17: 3 A C 6 18: 3 C C 9
# dropping duplicates by position
drop <- CJ(1:3, 1:3)[V1 > V2, which = TRUE]
test[, CJ(c(a, b, c), c(a, b, c), sorted = FALSE)[,cn := .I][-drop], by = rn]
rn V1 V2 cn 1: 1 C C 1 2: 1 C B 2 3: 1 C A 3 4: 1 B B 5 5: 1 B A 6 6: 1 A A 9 7: 2 A A 1 8: 2 A C 2 9: 2 A B 3 10: 2 C C 5 11: 2 C B 6 12: 2 B B 9 13: 3 B B 1 14: 3 B A 2 15: 3 B C 3 16: 3 A A 5 17: 3 A C 6 18: 3 C C 9
For illustration, the created combinations have been numbered consecutively before being filtered. Filtering by position keeps the same combinations cn
for each input row rn
.
If all columns of df
are to be used for creating combinations while keeping positions the code becomes finally
drop <- CJ(seq_along(df), seq_along(df))[V1 > V2, which = TRUE]
setDT(df)[, rn := .I] # execution order is important, drop needs to be computed first
df[df[, CJ(do.call("c", .SD), do.call("c", .SD), sorted = FALSE)[-drop], by = rn][
, value := calc(V1, V2)][
, dcast(.SD, rn ~ rowid(rn))], on = .(rn)][, !"rn"]
a b c 1 2 3 4 1: -0.56047565 -0.99579872 -0.5116037 0.273177095 0.443480566 0.25211300 0.6889459178 2: -0.23017749 -1.03995504 0.2369379 0.051625832 0.214606608 -0.05310253 0.7330919071 3: 1.55870831 -0.01798024 -0.5415892 1.232435358 -0.027640410 -0.61203449 0.0003232368 4: 0.07050839 -0.13217513 1.2192276 0.004959116 -0.009276298 0.08246971 0.0173194151 5: 0.12928774 -2.54934277 0.1741359 0.016577155 -0.284877208 0.02226394 2.0147894919 --- 996: -0.08997520 0.07664366 1.0609662 0.008062943 -0.006872360 -0.09117496 0.0058570650 997: 1.07051604 0.25516476 -0.4455056 0.763607781 0.241500405 -0.38995947 0.0630771939 998: -1.35110039 0.27744682 -0.4291802 1.038675520 -0.318350818 0.45733970 0.0741577975 999: -0.52261670 0.53685602 1.1890118 0.241477031 -0.247305230 -0.48328838 0.2532570646 1000: -0.24919068 -0.46048557 0.8342941 0.060244308 0.108629008 -0.18888191 0.1923106299 5 6 1: 0.411748217 0.23249043 2: -0.220263201 0.05462033 3: 0.009690796 0.25721165 4: -0.149412251 0.91088256 5: -0.367369972 0.02987264 --- 996: 0.078179132 0.75407734 997: -0.107667460 0.18105015 998: -0.112502154 0.16906373 999: 0.493676298 0.88118117 1000: -0.325108180 0.52830001
Upvotes: 3
Reputation: 6685
You could use a nested lapply
to create the new columns, then use a nested do.call
to bind them to your data:
set.seed(123)
df <- data.frame(a = rnorm(1000), b = rnorm(1000), c = rnorm(1000))
calc <- function(i,j)
{
abs(i*j)/(i*j)*log(1+abs(i*j))
}
newcols <- lapply(df, function(x) lapply(df, function(y) calc(x, y)))
df_new <- cbind(df, do.call(cbind, do.call(cbind, newcols)))
> head(df_new)
a b c 1 2 3 4 5 6 7
1 -0.56047565 -0.99579872 -0.5116037 0.273177095 0.443480566 0.25211300 0.443480566 0.6889459178 0.411748217 0.25211300
2 -0.23017749 -1.03995504 0.2369379 0.051625832 0.214606608 -0.05310253 0.214606608 0.7330919071 -0.220263201 -0.05310253
3 1.55870831 -0.01798024 -0.5415892 1.232435358 -0.027640410 -0.61203449 -0.027640410 0.0003232368 0.009690796 -0.61203449
4 0.07050839 -0.13217513 1.2192276 0.004959116 -0.009276298 0.08246971 -0.009276298 0.0173194151 -0.149412251 0.08246971
5 0.12928774 -2.54934277 0.1741359 0.016577155 -0.284877208 0.02226394 -0.284877208 2.0147894919 -0.367369972 0.02226394
6 1.71506499 1.04057346 -0.6152683 1.371548145 1.024122587 -0.72038540 1.024122587 0.7337098375 -0.494837621 -0.72038540
8 9
1 0.411748217 0.23249043
2 -0.220263201 0.05462033
3 0.009690796 0.25721165
4 -0.149412251 0.91088256
5 -0.367369972 0.02987264
6 -0.494837621 0.32103592
Upvotes: 2