kimwonjeong
kimwonjeong

Reputation: 51

For a period of time cumulative sum in R

  id   date     goal      date_followup_3month  cumulative_sum
  1 2004-12-31    1           2005-03-31              3
  1 2005-01-21    2           2005-04-21              6
  1 2005-04-04    3           2005-07-03              4
  1 2005-04-04    1           2005-07-03              1
  2 2001-01-05    4           2001-04-05              4
  2 2002-02-05    3           2002-05-06              5

cumulative_sum column is the cumulative value of the goal from date to three months for each ID.

The code I thought was as below and the result is not coming out.

error code : argument "yes" is missing, with no default

for(i in 1:length(id)){

    cumulative_sum[i] <-  for(j in 1:length(id))
    {max(cumsum(ifelse(id[i] == id[j]
                       & date[j] >= date[i]
                       & date[j] <= date_followup_3month[i])
                ,goal[j],0))}

}

I want to know if there is any other good code. Thank you very much.

Upvotes: 3

Views: 1081

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 270298

Here are several possibilities. The SQL based solution (1) seems highly readable and is potentially space efficient since SQL may optimize the join. The data.table solution (2) creates a large intermediate result but see the comment by @Frank to avoid that. Even larger intermediate results are created in (3) and (4) which may not be feasible if the data is sufficiently large. The loop based solution (5) is space efficient but uses a style not normally used in R. (5) can be translated to C++ (using Rcpp) in a straight forard way which we show in (6).

1) sqldf This can be expressed in SQL using a self join on a complex logical condition:

library(sqldf)
sqldf("select a.*, sum(b.goal) cumulative_sum
  from DF a
  join DF b on a.id = b.id and b.rowid >= a.rowid and b.date <= a.date_followup_3month
  group by a.rowid")

giving:

  id       date goal date_followup_3month cumulative_sum
1  1 2004-12-31    1           2005-03-31              3
2  1 2005-01-21    2           2005-04-21              6
3  1 2005-04-04    3           2005-07-03              4
4  1 2005-04-04    1           2005-07-03              1
5  2 2001-01-05    4           2001-04-05              4
6  2 2002-02-05    3           2002-05-06              3

2) data.table This can be done in data.table as well although note that this involves creating an intermediate object with a large number of rows whereas sql may optimize that away.

library(data.table)

DT <- as.data.table(DF)
DT[, seq:=.I][
  DT, on = .(id == id, seq <= seq, date_followup_3month >= date)][
  , list(id = id[1], 
         date = date[1], 
         date_followup_3month = date_followup_3month[1],
         cumulative_sum = sum(i.goal)), by = seq]

3) Base R Here is a base solution which performs a self join explicitly over id only and then subsets the rows down for the other terms in the condition. Finally it uses tapply to perform the summation. It involves explicitly generating s which is an even larger intermediate result.

DF0 <- cbind(seq = 1:nrow(DF), DF)
s <- subset(merge(DF0, DF0, by = "id"), 
       seq.x <= seq.y & date_followup_3month.x >= date.y)
transform(DF, cumulative_sum = tapply(s$goal.y, s$seq.x, sum))

4) dplyr This uses dplyr and like (3) involves a potentially very large intermediate result since it performs a self join with respect only to id.

library(dplyr)
DF %>% 
  mutate(seq = 1:n()) %>% 
  inner_join(., ., by = "id", suffix = c("", ".x")) %>%
  filter(seq.x >= seq & date.x <= date_followup_3month) %>%
  group_by(seq, date, goal, date_followup_3month) %>%
  summarize(cumulative_sum = sum(goal.x)) %>%
  ungroup %>%
  select(-seq)

5) Loop - Base R Explicit looping is discouraged in R and can be slow but on the other hand it is relatively straight forward and space efficient. This could be used as a model to translate the code to C++ which we do in the solution following this one. Note that we included several optimizations. Because the input is sorted the j loop can start from i, rather than from 1, and as soon as the condition in the j loop fails we can immediately exit the j loop as the satisfying rows necessarily all occur together.

n <- nrow(DF)
Sum <- numeric(n)
for(i in 1:n) {
  for(j in i:n) {
    if (with(DF, id[i] == id[j] && date[j] <= date_followup_3month[i])) {
      Sum[i] <- Sum[i] + DF$goal[j]
    } else break
  }
}
transform(DF, cumulative_sum = Sum)

6) Rcpp We can translate (5) to C++. Assume we have a file called cum_sum.cpp containing this:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector cum_sum(NumericVector id, IntegerVector date, 
  IntegerVector date_followup_3month, NumericVector goal) {
  auto n = id.size();
  NumericVector Sum(n);
  for(auto i = 0; i < n; i++) {
    Sum[i] = 0.0;
    for(auto j = i; j < n; j++) {
      if (id[i] == id[j] && date[j] <= date_followup_3month[i]) {
        Sum[i] = Sum[i] + goal[j];
      } else break;
    }
  }
  return Sum;
}

Then run:

library(Rcpp)
sourceCpp("cum_sum.cpp")
transform(DF, cumulative_sum = 
                cum_sum(id, date, date_followup_3month, as.numeric(goal)))

Note

The input DF in reproducible form is:

Lines <- "id   date     goal      date_followup_3month
  1 2004-12-31    1           2005-03-31
  1 2005-01-21    2           2005-04-21
  1 2005-04-04    3           2005-07-03
  1 2005-04-04    1           2005-07-03
  2 2001-01-05    4           2001-04-05
  2 2002-02-05    3           2002-05-06"
DF <- read.table(text = Lines, header = TRUE)
DF$date <- as.Date(DF$date)
DF$date_followup_3month <- as.Date(DF$date_followup_3month)

Upvotes: 4

Nar
Nar

Reputation: 658

You can use just sum for the lines which satisfy condition on dates and id instead of max(cumsum). Also to avoid nested loops, function could be use. Example with some simplification is below:

      goalsum <- function(date, i){
      start <- date$date[i]
      end <- date$date_followup_3month[i]
      ind <- date$id[i]
      tot_goal <- date%>%
        filter(date>=start & date<=end & id==ind)%>%
        summarise(sum(goal))
      return(tot_goal[1,1])
    }

    for(i in 1:length(date)){date$res[i] <-goalsum(date, i)}

Upvotes: 1

Related Questions