Reputation: 157
I have two columns with values, a and b. I want to add a third column c, which is (at row i) the sum from 0 to i of b plus the sum from 0 to (i-1) of c, multiplied with a, i.e.
c_i = (sum_i (b) + sum_(i-1) (c) ) * a_i
I tried
data %>%
mutate(
c = a * (cumsum(b) + lag(cumsum(c), default = 0))
)
However this doesn't work, as I am just creating c based on values of c that don't exist at the moment:
Error: Problem with `mutate()` input `c`. x object 'c' not found
Previously I handled such problems using for-loops. However, I got used to dplyr, and there is always a way. However, I do not get it.
I am grateful for any help!
edit: In a previous version I was inaccurate, as a is also a vector, not a constant. I changed it in the formula
The desired output:
row 1: 0.5 * (7 + 0 ) =3.5
row 2: 0.3 * (7+1 + 3.5) = 3.45
row 3: 1.0 * (7+1+9 + 3.5+3.45) = 23.95
| a | b | c |
|---|---|---|
|0.5|7|3.5|
|0.3|1|3.45|
|1|9|23.95|
|0.2|10|...|
Upvotes: 7
Views: 941
Reputation: 16978
It's a little bit off-topic, but I did a benchmark of all solutions shown here. I took the solutions as shown in the answers, the only renaming 27 ϕ 9's function to g
.
library(microbenchmark)
library(Rcpp)
library(dplyr)
library(purrr)
microbenchmark(
f_TIC_1 = transform(
df,
c = solve(
`diag<-`(mat <- matrix(-a, length(a), length(a)), 1) * lower.tri(mat, diag = TRUE),
a * cumsum(b)
)
),
f_TIC_2 = df %>%
mutate(
c = solve(
`diag<-`(mat <- matrix(-a, length(a), length(a)), 1) * lower.tri(mat, diag = TRUE),
a * cumsum(b)
)
),
f_TIC_3 = df %>% mutate(c = f(n())),
f_27p9 = df %>% mutate(c = g(a, b)),
f_AnG_1 = transform(df, C = {x <- 0; Reduce(function(.x, .y){x <<- .x + x; (cumsum(b)[[.y]] + x) * a[[.y]]},
seq(nrow(df)),
init = 0,
accumulate = TRUE)[-1]}),
f_AnG_2 = df %>%
mutate(C = {x <- 0; Reduce(function(.x, .y){x <<- .x + x; (cumsum(b)[[.y]] + x) * a[[.y]]},
seq(nrow(df)),
init = 0,
accumulate = TRUE)[-1]}),
f_AnG_3 = df %>%
mutate(C = {x <- 0; unlist(accumulate2(cumsum(b), a, .init = 0, ~ {x <<- ..1 + x; (..2 + x) * ..3 }))[-1]}),
f_JCA = df$c <- custom_cumsum(df$a, df$b),
times = 10000
)
This gives us a clear winner:
Unit: microseconds
expr min lq mean median uq max neval
f_TIC_1 169.8 194.2 229.09701 204.6 216.8 68505.6 10000
f_TIC_2 1314.2 1372.3 1500.52929 1384.1 1415.0 9688.8 10000
f_TIC_3 1360.6 1429.7 1557.12813 1442.8 1476.0 8486.1 10000
f_27p9 1351.8 1402.4 1553.47802 1415.4 1446.9 68681.4 10000
f_AnG_1 143.2 166.8 187.83451 174.5 184.6 4446.3 10000
f_AnG_2 1337.8 1395.4 1532.25551 1407.1 1439.5 64981.6 10000
f_AnG_3 1371.0 1428.1 1550.83456 1441.3 1474.0 10439.1 10000
f_JCA 48.0 71.6 79.39209 76.3 82.3 5544.4 10000
@Jean-Claude Arbaut's custom C
function ranks first in performance, followed by @AnilGoyal's transform
-Reduce
-solution and @ThomasIsCoding's transform
-solve
-matrix
way ranking third place.
The tidyverse
-solutions are quite inefficient and they take the same time as @ThomasIsCoding's "Recursion approach, INFFICIENT!!!!111".
Thanks to 27 ϕ 9's comment, I made another example using a data.frame of length 1000
set.seed(2^13)
n <- 1000
df <- data.frame(a = runif(n),
b = sample(1:n, n, replace = TRUE))
The first and interesting insight: the transform
-solve
-matrix
solutions refuse to work returning an error
Error in solve.default(
diag<-
(mat <- matrix(-a, length(a), length(a)), : system is computationally singular: reciprocal condition number
So I removed those solutions from the benchmark:
microbenchmark::microbenchmark(
# f_TIC_1 = transform(
# df,
# c = solve(
# `diag<-`(mat <- matrix(-a, length(a), length(a)), 1) * lower.tri(mat, diag = TRUE),
# a * cumsum(b)
# )
# ),
# f_TIC_2 = df %>%
# mutate(
# c = solve(
# `diag<-`(mat <- matrix(-a, length(a), length(a)), 1) * lower.tri(mat, diag = TRUE),
# a * cumsum(b)
# )
# ),
f_TIC_3 = df %>% mutate(c = f(n())),
f_27p9 = df %>% mutate(c = g(a, b)),
f_AnG_1 = transform(df, C = {x <- 0; Reduce(function(.x, .y){x <<- .x + x; (cumsum(b)[[.y]] + x) * a[[.y]]},
seq(nrow(df)),
init = 0,
accumulate = TRUE)[-1]}),
f_AnG_2 = df %>%
mutate(C = {x <- 0; Reduce(function(.x, .y){x <<- .x + x; (cumsum(b)[[.y]] + x) * a[[.y]]},
seq(nrow(df)),
init = 0,
accumulate = TRUE)[-1]}),
f_AnG_3 = df %>%
mutate(C = {x <- 0; unlist(accumulate2(cumsum(b), a, .init = 0, ~ {x <<- ..1 + x; (..2 + x) * ..3 }))[-1]}),
f_JCA = df$c <- custom_cumsum(df$a, df$b),
times = 10000
)
This returns
Unit: microseconds
expr min lq mean median uq max neval
f_TIC_3 8880.6 9264.65 10369.56781 9401.40 9632.55 93771.1 10000
f_27p9 3765.4 4050.80 4549.94921 4140.95 4259.25 80884.8 10000
f_AnG_1 3494.1 3670.70 4050.70616 3724.50 3812.10 83652.8 10000
f_AnG_2 5296.8 5603.70 6128.01699 5688.20 5803.85 82966.1 10000
f_AnG_3 3966.5 4249.30 4732.57000 4340.15 4459.40 85970.4 10000
f_JCA 11.3 23.00 25.84787 26.20 27.20 6148.9 10000
The next benchmark I did was using
set.seed(2^14)
n <- 10000
df <- data.frame(a = runif(n),
b = sample(1:10, n, replace = TRUE))
This time f_TIC_3
refused to run yielding
Error: C stack usage 15923184 is too close to the limit
Using a recursive function on large datasets doesn't seem to be an effective approach. I reduced the times
argument from 10_000 to 1_000.
Unit: microseconds
expr min lq mean median uq max neval
f_27p9 29822.6 33721.1 38248.6520 37433.10 40815.25 120926.6 1000
f_AnG_1 184626.9 198480.4 212004.0363 213608.10 218590.20 301492.9 1000
f_AnG_2 195530.7 211344.4 223346.3003 224671.55 229813.65 306081.5 1000
f_AnG_3 31657.3 35657.6 39882.8815 39256.85 42544.60 121925.6 1000
f_JCA 49.5 65.7 75.9027 67.40 69.50 4818.2 1000
So, still Jean-Claude Arbaut's C
-function wins a gold medal.
Upvotes: 2
Reputation: 26218
Perhaps I would have done it in similar fashion like @27phi9. You may, however, do this without writing any function before hand. I am giving you three approaches (i) baseR, (ii) dplyr only, (iii) dplyr + purrr
df <- structure(list(a = c(0.5, 0.3, 1, 0.2, 0.4, 0.8), b = c(7L, 1L, 9L, 10L, 3L, 2L)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
transform(df, C = {x <- 0; Reduce(function(.x, .y){x <<- .x + x; (cumsum(b)[[.y]] + x) * a[[.y]]},
seq(nrow(df)),
init = 0,
accumulate = TRUE)[-1]})
#> a b C
#> 1 0.5 7 3.5000
#> 2 0.3 1 3.4500
#> 3 1.0 9 23.9500
#> 4 0.2 10 11.5800
#> 5 0.4 3 28.9920
#> 6 0.8 2 82.7776
library(dplyr)
df %>%
mutate(C = {x <- 0; Reduce(function(.x, .y){x <<- .x + x; (cumsum(b)[[.y]] + x) * a[[.y]]},
seq(nrow(df)),
init = 0,
accumulate = TRUE)[-1]})
#> # A tibble: 6 x 3
#> a b C
#> <dbl> <int> <dbl>
#> 1 0.5 7 3.5
#> 2 0.3 1 3.45
#> 3 1 9 24.0
#> 4 0.2 10 11.6
#> 5 0.4 3 29.0
#> 6 0.8 2 82.8
library(purrr)
df %>%
mutate(C = {x <- 0; unlist(accumulate2(cumsum(b), a, .init = 0, ~ {x <<- ..1 + x; (..2 + x) * ..3 }))[-1]})
#> # A tibble: 6 x 3
#> a b C
#> <dbl> <int> <dbl>
#> 1 0.5 7 3.5
#> 2 0.3 1 3.45
#> 3 1 9 24.0
#> 4 0.2 10 11.6
#> 5 0.4 3 29.0
#> 6 0.8 2 82.8
Upvotes: 4
Reputation: 101337
A super efficient option is by solving the a linear matrix (thank @Martin Gal for comments):
transform(
df,
C = solve(
`diag<-`(mat <- matrix(-a, length(a), length(a)), 1) * lower.tri(mat, diag = TRUE),
a * cumsum(b)
)
)
which gives
a b C
1 0.5 7 3.5000
2 0.3 1 3.4500
3 1.0 9 23.9500
4 0.2 10 11.5800
5 0.4 3 28.9920
6 0.8 2 82.7776
or in a dplyr
manner
df %>%
mutate(
C = solve(
`diag<-`(mat <- matrix(-a, length(a), length(a)), 1) * lower.tri(mat, diag = TRUE),
a * cumsum(b)
)
)
which gives
# A tibble: 6 x 3
a b C
<dbl> <int> <dbl>
1 0.5 7 3.5
2 0.3 1 3.45
3 1 9 24.0
4 0.2 10 11.6
5 0.4 3 29.0
6 0.8 2 82.8
A base R option (but inefficient) by defining a recursion function f
f <- function(k) {
if (k == 1) {
return(with(df[k, ], a * b))
}
r <- f(k - 1)
c(r, with(df, a[k] * (sum(b[1:k]) + sum(r))))
}
and you will see
> f(nrow(df))
[1] 3.5000 3.4500 23.9500 11.5800 28.9920 82.7776
and
> df %>%
+ mutate(C = f(n()))
# A tibble: 6 x 3
a b C
<dbl> <int> <dbl>
1 0.5 7 3.5
2 0.3 1 3.45
3 1 9 24.0
4 0.2 10 11.6
5 0.4 3 29.0
6 0.8 2 82.8
Upvotes: 4
Reputation: 34441
One way you can achieve this is with a custom function that relies on purrr::accumulate2()
.
library(dplyr)
library(purrr)
f <- function(a, b) {
b <- cumsum(b)
x_cum <- 0
unlist(accumulate2(a, b, ~ {
x_cum <<- ..1 + x_cum
..2 * (..3 + x_cum)
}, .init = 0))[-1]
}
df %>%
mutate(c = f(a, b))
# A tibble: 6 x 3
a b c
<dbl> <int> <dbl>
1 0.5 7 3.5
2 0.3 1 3.45
3 1 9 24.0
4 0.2 10 11.6
5 0.4 3 29.0
6 0.8 2 82.8
Upvotes: 3
Reputation:
Sometimes, it's just simpler to do it in C.
library(Rcpp)
cppFunction("NumericVector custom_cumsum(double a, NumericVector b) {
int n = b.size();
NumericVector c(n);
double sb = 0, sc = 0;
for(int i = 0; i < n; i++) {
sb += b[i];
sc += (c[i] = a * (sb + sc));
}
return c;
}")
custom_cumsum(1.0, 1:10)
If a
is also a vector:
cppFunction("NumericVector custom_cumsum(NumericVector a, NumericVector b) {
int n = b.size();
NumericVector c(n);
double sb = 0, sc = 0;
for(int i = 0; i < n; i++) {
sb += b[i];
sc += (c[i] = a[i] * (sb + sc));
}
return c;
}")
custom_cumsum(rep(1, 10), 1:10)
custom_cumsum(c(0.5, 0.3, 1, 0.2),
c(7, 1, 9, 10))
# result: 3.50 3.45 23.95 11.58
Upvotes: 2