C. Sebastian
C. Sebastian

Reputation: 157

Recursive sum over two variables using dplyr

I have two columns with values, a and b. I want to add a third column c, which is (at row i) the sum from 0 to i of b plus the sum from 0 to (i-1) of c, multiplied with a, i.e.

c_i = (sum_i (b) + sum_(i-1) (c) ) * a_i

I tried

data %>%
mutate(
 c = a * (cumsum(b) + lag(cumsum(c), default = 0))
)

However this doesn't work, as I am just creating c based on values of c that don't exist at the moment:

Error: Problem with `mutate()` input `c`.
x object 'c' not found

Previously I handled such problems using for-loops. However, I got used to dplyr, and there is always a way. However, I do not get it.

I am grateful for any help!

edit: In a previous version I was inaccurate, as a is also a vector, not a constant. I changed it in the formula

The desired output:

row 1: 0.5 * (7  + 0 ) =3.5

row 2: 0.3 * (7+1 + 3.5) = 3.45

row 3: 1.0 * (7+1+9 + 3.5+3.45) = 23.95

| a | b | c |
|---|---|---|
|0.5|7|3.5|
|0.3|1|3.45|
|1|9|23.95|
|0.2|10|...|

Upvotes: 7

Views: 941

Answers (5)

Martin Gal
Martin Gal

Reputation: 16978

It's a little bit off-topic, but I did a benchmark of all solutions shown here. I took the solutions as shown in the answers, the only renaming 27 ϕ 9's function to g.

library(microbenchmark)

library(Rcpp)
library(dplyr)
library(purrr)

microbenchmark(
  f_TIC_1 = transform(
    df,
    c = solve(
      `diag<-`(mat <- matrix(-a, length(a), length(a)), 1) * lower.tri(mat, diag = TRUE),
      a * cumsum(b)
    )
  ),
  f_TIC_2 = df %>%
    mutate(
      c = solve(
        `diag<-`(mat <- matrix(-a, length(a), length(a)), 1) * lower.tri(mat, diag = TRUE),
        a * cumsum(b)
      )
    ),
  f_TIC_3 = df %>% mutate(c = f(n())),
  f_27p9  = df %>% mutate(c = g(a, b)),
  f_AnG_1 = transform(df, C = {x <- 0; Reduce(function(.x, .y){x <<- .x + x; (cumsum(b)[[.y]] + x) * a[[.y]]}, 
                                              seq(nrow(df)), 
                                              init = 0, 
                                              accumulate = TRUE)[-1]}),
  f_AnG_2 = df %>%
    mutate(C = {x <- 0; Reduce(function(.x, .y){x <<- .x + x; (cumsum(b)[[.y]] + x) * a[[.y]]}, 
                               seq(nrow(df)), 
                               init = 0, 
                               accumulate = TRUE)[-1]}),
  f_AnG_3 = df %>%
    mutate(C = {x <- 0; unlist(accumulate2(cumsum(b), a, .init = 0, ~ {x <<- ..1 + x; (..2 + x) * ..3 }))[-1]}),
  f_JCA  = df$c <- custom_cumsum(df$a, df$b),
  times = 10000
)

This gives us a clear winner:

Unit: microseconds
    expr    min     lq       mean median     uq     max neval
 f_TIC_1  169.8  194.2  229.09701  204.6  216.8 68505.6 10000
 f_TIC_2 1314.2 1372.3 1500.52929 1384.1 1415.0  9688.8 10000
 f_TIC_3 1360.6 1429.7 1557.12813 1442.8 1476.0  8486.1 10000
  f_27p9 1351.8 1402.4 1553.47802 1415.4 1446.9 68681.4 10000
 f_AnG_1  143.2  166.8  187.83451  174.5  184.6  4446.3 10000
 f_AnG_2 1337.8 1395.4 1532.25551 1407.1 1439.5 64981.6 10000
 f_AnG_3 1371.0 1428.1 1550.83456 1441.3 1474.0 10439.1 10000
   f_JCA   48.0   71.6   79.39209   76.3   82.3  5544.4 10000

@Jean-Claude Arbaut's custom C function ranks first in performance, followed by @AnilGoyal's transform-Reduce-solution and @ThomasIsCoding's transform-solve-matrix way ranking third place. The tidyverse-solutions are quite inefficient and they take the same time as @ThomasIsCoding's "Recursion approach, INFFICIENT!!!!111".

Edit

Thanks to 27 ϕ 9's comment, I made another example using a data.frame of length 1000

set.seed(2^13)
n <- 1000
df <- data.frame(a = runif(n), 
                 b = sample(1:n, n, replace = TRUE))

The first and interesting insight: the transform-solve-matrix solutions refuse to work returning an error

Error in solve.default(diag<-(mat <- matrix(-a, length(a), length(a)), : system is computationally singular: reciprocal condition number

So I removed those solutions from the benchmark:

microbenchmark::microbenchmark(
  # f_TIC_1 = transform(
  #   df,
  #   c = solve(
  #     `diag<-`(mat <- matrix(-a, length(a), length(a)), 1) * lower.tri(mat, diag = TRUE),
  #     a * cumsum(b)
  #   )
  # ),
  # f_TIC_2 = df %>%
  #   mutate(
  #     c = solve(
  #       `diag<-`(mat <- matrix(-a, length(a), length(a)), 1) * lower.tri(mat, diag = TRUE),
  #       a * cumsum(b)
  #     )
  #   ),
  f_TIC_3 = df %>%  mutate(c = f(n())),
  f_27p9  = df %>% mutate(c = g(a, b)),
  f_AnG_1 = transform(df, C = {x <- 0; Reduce(function(.x, .y){x <<- .x + x; (cumsum(b)[[.y]] + x) * a[[.y]]}, 
                                              seq(nrow(df)), 
                                              init = 0, 
                                              accumulate = TRUE)[-1]}),
  f_AnG_2 = df %>%
    mutate(C = {x <- 0; Reduce(function(.x, .y){x <<- .x + x; (cumsum(b)[[.y]] + x) * a[[.y]]}, 
                               seq(nrow(df)), 
                               init = 0, 
                               accumulate = TRUE)[-1]}),
  f_AnG_3 = df %>%
    mutate(C = {x <- 0; unlist(accumulate2(cumsum(b), a, .init = 0, ~ {x <<- ..1 + x; (..2 + x) * ..3 }))[-1]}),
  f_JCA  = df$c <- custom_cumsum(df$a, df$b),
  times = 10000
)

This returns

Unit: microseconds
    expr    min      lq        mean  median      uq     max neval
 f_TIC_3 8880.6 9264.65 10369.56781 9401.40 9632.55 93771.1 10000
  f_27p9 3765.4 4050.80  4549.94921 4140.95 4259.25 80884.8 10000
 f_AnG_1 3494.1 3670.70  4050.70616 3724.50 3812.10 83652.8 10000
 f_AnG_2 5296.8 5603.70  6128.01699 5688.20 5803.85 82966.1 10000
 f_AnG_3 3966.5 4249.30  4732.57000 4340.15 4459.40 85970.4 10000
   f_JCA   11.3   23.00    25.84787   26.20   27.20  6148.9 10000

The next benchmark I did was using

set.seed(2^14)

n <- 10000
df <- data.frame(a = runif(n), 
                 b = sample(1:10, n, replace = TRUE))

This time f_TIC_3 refused to run yielding

Error: C stack usage 15923184 is too close to the limit

Using a recursive function on large datasets doesn't seem to be an effective approach. I reduced the times argument from 10_000 to 1_000.

Unit: microseconds
    expr      min       lq        mean    median        uq      max neval
  f_27p9  29822.6  33721.1  38248.6520  37433.10  40815.25 120926.6  1000
 f_AnG_1 184626.9 198480.4 212004.0363 213608.10 218590.20 301492.9  1000
 f_AnG_2 195530.7 211344.4 223346.3003 224671.55 229813.65 306081.5  1000
 f_AnG_3  31657.3  35657.6  39882.8815  39256.85  42544.60 121925.6  1000
   f_JCA     49.5     65.7     75.9027     67.40     69.50   4818.2  1000

So, still Jean-Claude Arbaut's C-function wins a gold medal.

Upvotes: 2

AnilGoyal
AnilGoyal

Reputation: 26218

Perhaps I would have done it in similar fashion like @27phi9. You may, however, do this without writing any function before hand. I am giving you three approaches (i) baseR, (ii) dplyr only, (iii) dplyr + purrr

df <- structure(list(a = c(0.5, 0.3, 1, 0.2, 0.4, 0.8), b = c(7L, 1L,  9L, 10L, 3L, 2L)), row.names = c(NA, -6L), class = c("tbl_df",  "tbl", "data.frame"))

transform(df, C = {x <- 0; Reduce(function(.x, .y){x <<- .x + x; (cumsum(b)[[.y]] + x) * a[[.y]]}, 
                                  seq(nrow(df)), 
                                  init = 0, 
                                  accumulate = TRUE)[-1]})
#>     a  b       C
#> 1 0.5  7  3.5000
#> 2 0.3  1  3.4500
#> 3 1.0  9 23.9500
#> 4 0.2 10 11.5800
#> 5 0.4  3 28.9920
#> 6 0.8  2 82.7776

library(dplyr)

df %>%
  mutate(C = {x <- 0; Reduce(function(.x, .y){x <<- .x + x; (cumsum(b)[[.y]] + x) * a[[.y]]}, 
                             seq(nrow(df)), 
                             init = 0, 
                             accumulate = TRUE)[-1]})
#> # A tibble: 6 x 3
#>       a     b     C
#>   <dbl> <int> <dbl>
#> 1   0.5     7  3.5 
#> 2   0.3     1  3.45
#> 3   1       9 24.0 
#> 4   0.2    10 11.6 
#> 5   0.4     3 29.0 
#> 6   0.8     2 82.8

library(purrr)
df %>%
  mutate(C = {x <- 0; unlist(accumulate2(cumsum(b), a, .init = 0, ~ {x <<- ..1 + x; (..2 + x) * ..3 }))[-1]})
#> # A tibble: 6 x 3
#>       a     b     C
#>   <dbl> <int> <dbl>
#> 1   0.5     7  3.5 
#> 2   0.3     1  3.45
#> 3   1       9 24.0 
#> 4   0.2    10 11.6 
#> 5   0.4     3 29.0 
#> 6   0.8     2 82.8

Upvotes: 4

ThomasIsCoding
ThomasIsCoding

Reputation: 101337

Update

A super efficient option is by solving the a linear matrix (thank @Martin Gal for comments):

transform(
  df,
  C = solve(
    `diag<-`(mat <- matrix(-a, length(a), length(a)), 1) * lower.tri(mat, diag = TRUE),
    a * cumsum(b)
  )
)

which gives

    a  b       C
1 0.5  7  3.5000
2 0.3  1  3.4500
3 1.0  9 23.9500
4 0.2 10 11.5800
5 0.4  3 28.9920
6 0.8  2 82.7776

or in a dplyr manner

df %>%
  mutate(
    C = solve(
      `diag<-`(mat <- matrix(-a, length(a), length(a)), 1) * lower.tri(mat, diag = TRUE),
      a * cumsum(b)
    )
  )

which gives

# A tibble: 6 x 3
      a     b     C
  <dbl> <int> <dbl>
1   0.5     7  3.5
2   0.3     1  3.45
3   1       9 24.0
4   0.2    10 11.6
5   0.4     3 29.0
6   0.8     2 82.8

Previous answer (Recursion approach, INFFICIENT)


A base R option (but inefficient) by defining a recursion function f

f <- function(k) {
  if (k == 1) {
    return(with(df[k, ], a * b))
  }
  r <- f(k - 1)
  c(r, with(df, a[k] * (sum(b[1:k]) + sum(r))))
}

and you will see

> f(nrow(df))
[1]  3.5000  3.4500 23.9500 11.5800 28.9920 82.7776

and

> df %>%
+   mutate(C = f(n()))
# A tibble: 6 x 3
      a     b     C
  <dbl> <int> <dbl>
1   0.5     7  3.5
2   0.3     1  3.45
3   1       9 24.0
4   0.2    10 11.6
5   0.4     3 29.0
6   0.8     2 82.8

Upvotes: 4

lroha
lroha

Reputation: 34441

One way you can achieve this is with a custom function that relies on purrr::accumulate2().

library(dplyr)
library(purrr)

f <- function(a, b) {
  b <- cumsum(b)
  x_cum <- 0
  unlist(accumulate2(a, b, ~ {
    x_cum <<- ..1 + x_cum
    ..2 * (..3 + x_cum)
  }, .init = 0))[-1]
}



df %>%
  mutate(c = f(a, b))

# A tibble: 6 x 3
      a     b     c
  <dbl> <int> <dbl>
1   0.5     7  3.5 
2   0.3     1  3.45
3   1       9 24.0 
4   0.2    10 11.6 
5   0.4     3 29.0 
6   0.8     2 82.8 

Upvotes: 3

user13963867
user13963867

Reputation:

Sometimes, it's just simpler to do it in C.

library(Rcpp)

cppFunction("NumericVector custom_cumsum(double a, NumericVector b) {
  int n = b.size();
  NumericVector c(n);
  double sb = 0, sc = 0;

  for(int i = 0; i < n; i++) {
    sb += b[i];
    sc += (c[i] = a * (sb + sc));
  }
  return c;
}")

custom_cumsum(1.0, 1:10)

If a is also a vector:

cppFunction("NumericVector custom_cumsum(NumericVector a, NumericVector b) {
  int n = b.size();
  NumericVector c(n);
  double sb = 0, sc = 0;

  for(int i = 0; i < n; i++) {
    sb += b[i];
    sc += (c[i] = a[i] * (sb + sc));
  }
  return c;
}")

custom_cumsum(rep(1, 10), 1:10)

custom_cumsum(c(0.5, 0.3, 1, 0.2),
              c(7, 1, 9, 10))
# result: 3.50  3.45 23.95 11.58

Upvotes: 2

Related Questions