Reputation: 1069
I have two dataframes, one (dataframe 1) with dates and a number of observations on each date. In the other table (dataframe 2) I have more dates than I have in the other dataframe.
What I would like to do is calculate the cumulative sum over dataframe 1 and place the result in dataframe 2. Any date that exists in dataframe 2, but not in dataframe 1 should simply have the value from the previous day
Dataframe 1
Date Obs
2015-01-10 2
2015-01-11 3
2015-01-16 1
2015-01-20 4
Dataframe 2
Date cumSum
2015-01-09 0
2015-01-10 2
2015-01-11 5
2015-01-12 5
2015-01-13 5
2015-01-14 5
2015-01-15 5
2015-01-16 6
2015-01-17 6
2015-01-18 6
2015-01-19 6
2015-01-20 10
2015-01-21 10
Please let me know if anything is unclear. Any help would be great!
Thanks,
Mike
Upvotes: 0
Views: 1506
Reputation: 887851
We could use the devel version of data.table
i.e. v1.9.5
(Instructions to install the devel version are here
.
We convert the first 'data.frame' ('df1') to 'data.table' (setDT(df1)
), join with the 'Date' column of 'df2' using the on
option (available in the devel version). We create the 'Cumsum' column based on the non-NA elements in 'Obs' by doing the cumulative sum of the 'Obs' specified by the row index in i (!is.na(Obs)
). Then, we can use the na.locf
from library(zoo)
to replace the NA
values with the non-NA previous values and update the 'Cumsum' column. As the 'Obs' column is not in the expected output, we can assign (:=
) it to NULL.
library(data.table)#v1.9.5+
library(zoo)
res <- setDT(df1)[df2['Date'], on='Date'][!is.na(Obs), Cumsum:=cumsum(Obs)
][, Cumsum:=na.locf(Cumsum, na.rm=FALSE)][, Obs := NULL]
res
# Date Cumsum
# 1: 2015-01-09 NA
# 2: 2015-01-10 2
# 3: 2015-01-11 5
# 4: 2015-01-12 5
# 5: 2015-01-13 5
# 6: 2015-01-14 5
# 7: 2015-01-15 5
# 8: 2015-01-16 6
# 9: 2015-01-17 6
#10: 2015-01-18 6
#11: 2015-01-19 6
#12: 2015-01-20 10
#13: 2015-01-21 10
If needed, we can replace the NA
value in 'Cumsum' with '0'
res[is.na(Cumsum), Cumsum:=0]
Or as @Khashaa mentioned in the comments we can do this without na.locf
by using roll=Inf
setDT(df1)[,cumSum:=cumsum(Obs),][df2['Date'],
on='Date',roll=Inf][, Obs:= NULL][]
Or another option is match
with na.locf
to get the numeric index and replace the non-NA index (from match
) with the cumulative sum of 'Obs', use na.locf
as before and if needed, we can replace the NA
with 0.
df2$Cumsum <- na.locf(cumsum(df1$Obs)[match(df2$Date, df1$Date)], na.rm=FALSE)
df1 <- structure(list(Date = structure(c(16445, 16446, 16451, 16455),
class = "Date"),
Obs = c(2L, 3L, 1L, 4L)), .Names = c("Date", "Obs"),
row.names = c(NA, -4L), class = "data.frame")
df2 <- structure(list(Date = structure(c(16444, 16445, 16446, 16447,
16448, 16449, 16450, 16451, 16452, 16453, 16454, 16455, 16456
), class = "Date"), cumSum = c(0L, 2L, 5L, 5L, 5L, 5L, 5L, 6L,
6L, 6L, 6L, 10L, 10L)), .Names = c("Date", "cumSum"), row.names = c(NA,
-13L), class = "data.frame")
Upvotes: 2