Mike
Mike

Reputation: 1069

R: Cumulative sum with missing dates

I have two dataframes, one (dataframe 1) with dates and a number of observations on each date. In the other table (dataframe 2) I have more dates than I have in the other dataframe.

What I would like to do is calculate the cumulative sum over dataframe 1 and place the result in dataframe 2. Any date that exists in dataframe 2, but not in dataframe 1 should simply have the value from the previous day

Dataframe 1

Date        Obs  
2015-01-10  2  
2015-01-11  3  
2015-01-16  1  
2015-01-20  4  

Dataframe 2

Date       cumSum
2015-01-09  0
2015-01-10  2
2015-01-11  5
2015-01-12  5
2015-01-13  5
2015-01-14  5
2015-01-15  5
2015-01-16  6
2015-01-17  6
2015-01-18  6
2015-01-19  6
2015-01-20 10
2015-01-21 10

Please let me know if anything is unclear. Any help would be great!

Thanks,

Mike

Upvotes: 0

Views: 1506

Answers (1)

akrun
akrun

Reputation: 887851

We could use the devel version of data.table i.e. v1.9.5 (Instructions to install the devel version are here.

We convert the first 'data.frame' ('df1') to 'data.table' (setDT(df1)), join with the 'Date' column of 'df2' using the on option (available in the devel version). We create the 'Cumsum' column based on the non-NA elements in 'Obs' by doing the cumulative sum of the 'Obs' specified by the row index in i (!is.na(Obs)). Then, we can use the na.locf from library(zoo) to replace the NA values with the non-NA previous values and update the 'Cumsum' column. As the 'Obs' column is not in the expected output, we can assign (:=) it to NULL.

library(data.table)#v1.9.5+
library(zoo)
res <- setDT(df1)[df2['Date'], on='Date'][!is.na(Obs), Cumsum:=cumsum(Obs)
           ][, Cumsum:=na.locf(Cumsum, na.rm=FALSE)][, Obs := NULL]
res
#         Date Cumsum
# 1: 2015-01-09     NA
# 2: 2015-01-10      2
# 3: 2015-01-11      5
# 4: 2015-01-12      5
# 5: 2015-01-13      5
# 6: 2015-01-14      5
# 7: 2015-01-15      5
# 8: 2015-01-16      6
# 9: 2015-01-17      6
#10: 2015-01-18      6
#11: 2015-01-19      6
#12: 2015-01-20     10
#13: 2015-01-21     10

If needed, we can replace the NA value in 'Cumsum' with '0'

res[is.na(Cumsum), Cumsum:=0]

Or as @Khashaa mentioned in the comments we can do this without na.locf by using roll=Inf

 setDT(df1)[,cumSum:=cumsum(Obs),][df2['Date'],
                       on='Date',roll=Inf][, Obs:= NULL][]

Or another option is match with na.locf to get the numeric index and replace the non-NA index (from match) with the cumulative sum of 'Obs', use na.locf as before and if needed, we can replace the NA with 0.

df2$Cumsum <- na.locf(cumsum(df1$Obs)[match(df2$Date, df1$Date)], na.rm=FALSE)

data

df1 <- structure(list(Date = structure(c(16445, 16446, 16451, 16455),
class = "Date"), 
Obs = c(2L, 3L, 1L, 4L)), .Names = c("Date", "Obs"), 
row.names = c(NA, -4L), class = "data.frame")

df2 <-  structure(list(Date = structure(c(16444, 16445, 16446, 16447, 
16448, 16449, 16450, 16451, 16452, 16453, 16454, 16455, 16456
), class = "Date"), cumSum = c(0L, 2L, 5L, 5L, 5L, 5L, 5L, 6L, 
6L, 6L, 6L, 10L, 10L)), .Names = c("Date", "cumSum"), row.names = c(NA, 
-13L), class = "data.frame")

Upvotes: 2

Related Questions