Reputation: 122
I'm trying to calculating a mean over an expanding window, however the data structure makes it so that previous answers are missing at least a little bit of what's needed (closest is here: link).
My data looks something like:
Company TimePeriod IndividualID Date.Indiv.Acted Value
1 1 2015 A 2015-01-01 400
2 1 2015 B 2015-02-01 200
3 1 2015 A 2015-06-15 400
4 1 2015 C 2015-07-12 300
5 1 2016 A 2016-07-15 400
6 1 2016 B 2016-08-09 100
7 1 2016 C 2016-09-10 400
8 1 2016 A 2016-10-11 100
9 2 2004 A 2004-07-12 200
10 2 2004 B 2004-08-12 300
And I need to take the cumulative mean of the Value, by Company-TimePeriod, for each Date.Indiv.Acted. However, I need to remove duplicates as I go along keeping the most recent one. So for the first two averages, there's no problem - they would include row 1, and row 1 and row 2. However, row 1, 2, and 3 should remove row 1 because IndividualID is a duplicate. Essentially, I have forecast information and want to use only the most recent forecast by an individual in each calculation of the mean.
So my final data would look something like (rows added for ease of interpretation - don't need that in the data),
Company TimePeriod IndividualID Date.Indiv.Acted Value CumMean
1 1 2015 A 2015-01-01 400 400
2 1 2015 B 2015-02-01 200 300 (row 1 and 2)
3 1 2015 A 2015-06-15 400 300 (row 2 and 3)
4 1 2015 C 2015-07-12 300 300 (2,3,4)
5 1 2016 A 2016-07-15 400 400 (5)
6 1 2016 B 2016-08-09 100 250 (5,6)
7 1 2016 C 2016-09-10 400 300 (5,6,7)
8 1 2016 A 2016-10-11 100 200 (6,7,8)
9 2 2004 A 2004-07-12 200 200 (9)
10 2 2004 B 2004-08-12 300 250 (9,10)
A data.table solution would be ideal, but I'm not picky as long as it can run on fairly large data (20M rows or so) and not take until the heat death of the universe.
Any help you guys have would be greatly appreciated.
Upvotes: 3
Views: 635
Reputation: 2652
I particularly do not like loops, but I think this one got simple enough to understand the step by step. It can be easily altered to run any other metric instead of mean (e.g. cumulative variance)
# function that drops duplicates and calculates cumulative mean
fun.attempt <- function(dat, dup, value){
#dat: data set
#dup: string column to look for duplicates
#value: string column to calculate the mean
x <- dat[!duplicated(get(dup), fromLast = T), .(get(value))]
y <- cumsum(x) / 1:nrow(x)
y <- y[nrow(y)]
return(y)
}
foo[, grp := .GRP, by = .(Company, TimePeriod)] # to create a more efficient loop
hl <- list() # as storage
for(k in unique(foo$grp)){
got <- foo[grp == k] # running the cumulative mean for each grouping
for(y in 1:nrow(got)){
# applying customized function
got[y, cummean2:= fun.attempt(got[1:y], 'IndividualID', 'Value')]
}
hl[[k]] <- got # storing the subsetted data.tables
}
Now just compile the lists of data.tables. CumMean
column is your original calculation, cummean2
is mine.
rbindlist(hl)
Company TimePeriod IndividualID Date.Indiv.Acted Value CumMean grp cummean2
1: 1 2015 A 2015-01-01 400 400 1 400
2: 1 2015 B 2015-02-01 200 300 1 300
3: 1 2015 A 2015-06-15 400 300 1 300
4: 1 2015 C 2015-07-12 300 300 1 300
5: 1 2016 A 2016-07-15 400 400 2 400
6: 1 2016 B 2016-08-09 100 250 2 250
7: 1 2016 C 2016-09-10 400 300 2 300
8: 1 2016 A 2016-10-11 100 200 2 200
9: 2 2004 A 2004-07-12 200 200 3 200
10: 2 2004 B 2004-08-12 300 250 3 250
Upvotes: 1
Reputation: 7373
setDT(dt)
dt[, occ := 1:.N, by = .(Company, TimePeriod, IndividualID)]
dt[, n := cumsum(!duplicated(IndividualID)), by = .(Company, TimePeriod)]
dt[, Value1 := Value,]
dt[, x := c(0, diff(Value)), by = .(Company, TimePeriod, IndividualID)]
dt[occ>1, Value1 := x,]
dt[, Cummean := cumsum(Value1)/n, by = .(Company, TimePeriod)]
dt[, c("occ", "n", "Value1", "x") := NULL][]
# Company TimePeriod IndividualID Date.Indiv.Acted Value Cummean
# 1: 1 2015 A 2015-01-01 400 400
# 2: 1 2015 B 2015-02-01 200 300
# 3: 1 2015 A 2015-06-15 400 300
# 4: 1 2015 C 2015-07-12 300 300
# 5: 1 2016 A 2016-07-15 400 400
# 6: 1 2016 B 2016-08-09 100 250
# 7: 1 2016 C 2016-09-10 400 300
# 8: 1 2016 A 2016-10-11 100 200
# 9: 2 2004 A 2004-07-12 200 200
#10: 2 2004 B 2004-08-12 300 250
dt <- structure(list(Company = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2), TimePeriod = c(2015,
2015, 2015, 2015, 2016, 2016, 2016, 2016, 2004, 2004), IndividualID = c("A",
"B", "A", "C", "A", "B", "C", "A", "A", "B"), Date.Indiv.Acted = c("2015-01-01",
"2015-02-01", "2015-06-15", "2015-07-12", "2016-07-15", "2016-08-09",
"2016-09-10", "2016-10-11", "2004-07-12", "2004-08-12"), Value = c(400,
200, 400, 300, 400, 100, 400, 100, 200, 300)), row.names = c(NA,
-10L), class = "data.frame")
Upvotes: 4