Cumulative (expanding window) mean by group with duplicate check for each calculation

Question

I'm trying to calculating a mean over an expanding window, however the data structure makes it so that previous answers are missing at least a little bit of what's needed (closest is here: link).

My data looks something like:

  Company TimePeriod IndividualID Date.Indiv.Acted  Value 
1  1         2015          A           2015-01-01    400
2  1         2015          B           2015-02-01    200
3  1         2015          A           2015-06-15    400
4  1         2015          C           2015-07-12    300
5  1         2016          A           2016-07-15    400
6  1         2016          B           2016-08-09    100
7  1         2016          C           2016-09-10    400
8  1         2016          A           2016-10-11    100
9  2         2004          A           2004-07-12    200
10 2         2004          B           2004-08-12    300

And I need to take the cumulative mean of the Value, by Company-TimePeriod, for each Date.Indiv.Acted. However, I need to remove duplicates as I go along keeping the most recent one. So for the first two averages, there's no problem - they would include row 1, and row 1 and row 2. However, row 1, 2, and 3 should remove row 1 because IndividualID is a duplicate. Essentially, I have forecast information and want to use only the most recent forecast by an individual in each calculation of the mean.

So my final data would look something like (rows added for ease of interpretation - don't need that in the data),

  Company TimePeriod IndividualID Date.Indiv.Acted  Value CumMean 
1  1         2015          A           2015-01-01    400   400
2  1         2015          B           2015-02-01    200   300 (row 1 and 2)
3  1         2015          A           2015-06-15    400   300 (row 2 and 3)
4  1         2015          C           2015-07-12    300   300 (2,3,4)
5  1         2016          A           2016-07-15    400   400 (5)
6  1         2016          B           2016-08-09    100   250 (5,6)
7  1         2016          C           2016-09-10    400   300 (5,6,7)
8  1         2016          A           2016-10-11    100   200 (6,7,8)
9  2         2004          A           2004-07-12    200   200 (9)
10 2         2004          B           2004-08-12    300   250 (9,10)

A data.table solution would be ideal, but I'm not picky as long as it can run on fairly large data (20M rows or so) and not take until the heat death of the universe.

Any help you guys have would be greatly appreciated.

Khashaa · Accepted Answer

setDT(dt)
dt[, occ := 1:.N, by = .(Company, TimePeriod, IndividualID)]
dt[, n := cumsum(!duplicated(IndividualID)), by = .(Company, TimePeriod)]
dt[, Value1 := Value,]
dt[, x := c(0, diff(Value)), by = .(Company, TimePeriod, IndividualID)]
dt[occ>1, Value1 := x,]
dt[, Cummean := cumsum(Value1)/n, by = .(Company, TimePeriod)]
dt[, c("occ", "n", "Value1", "x") := NULL][]
#    Company TimePeriod IndividualID Date.Indiv.Acted Value Cummean
# 1:       1       2015            A       2015-01-01   400     400
# 2:       1       2015            B       2015-02-01   200     300
# 3:       1       2015            A       2015-06-15   400     300
# 4:       1       2015            C       2015-07-12   300     300
# 5:       1       2016            A       2016-07-15   400     400
# 6:       1       2016            B       2016-08-09   100     250
# 7:       1       2016            C       2016-09-10   400     300
# 8:       1       2016            A       2016-10-11   100     200
# 9:       2       2004            A       2004-07-12   200     200
#10:       2       2004            B       2004-08-12   300     250

dt <- structure(list(Company = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2), TimePeriod = c(2015, 
2015, 2015, 2015, 2016, 2016, 2016, 2016, 2004, 2004), IndividualID = c("A", 
"B", "A", "C", "A", "B", "C", "A", "A", "B"), Date.Indiv.Acted = c("2015-01-01", 
"2015-02-01", "2015-06-15", "2015-07-12", "2016-07-15", "2016-08-09", 
"2016-09-10", "2016-10-11", "2004-07-12", "2004-08-12"), Value = c(400, 
200, 400, 300, 400, 100, 400, 100, 200, 300)), row.names = c(NA, 
-10L), class = "data.frame")

Cumulative (expanding window) mean by group with duplicate check for each calculation

Answers (2)

Related Questions