J M
J M

Reputation: 122

Cumulative (expanding window) mean by group with duplicate check for each calculation

I'm trying to calculating a mean over an expanding window, however the data structure makes it so that previous answers are missing at least a little bit of what's needed (closest is here: link).

My data looks something like:

  Company TimePeriod IndividualID Date.Indiv.Acted  Value 
1  1         2015          A           2015-01-01    400
2  1         2015          B           2015-02-01    200
3  1         2015          A           2015-06-15    400
4  1         2015          C           2015-07-12    300
5  1         2016          A           2016-07-15    400
6  1         2016          B           2016-08-09    100
7  1         2016          C           2016-09-10    400
8  1         2016          A           2016-10-11    100
9  2         2004          A           2004-07-12    200
10 2         2004          B           2004-08-12    300

And I need to take the cumulative mean of the Value, by Company-TimePeriod, for each Date.Indiv.Acted. However, I need to remove duplicates as I go along keeping the most recent one. So for the first two averages, there's no problem - they would include row 1, and row 1 and row 2. However, row 1, 2, and 3 should remove row 1 because IndividualID is a duplicate. Essentially, I have forecast information and want to use only the most recent forecast by an individual in each calculation of the mean.

So my final data would look something like (rows added for ease of interpretation - don't need that in the data),

  Company TimePeriod IndividualID Date.Indiv.Acted  Value CumMean 
1  1         2015          A           2015-01-01    400   400
2  1         2015          B           2015-02-01    200   300 (row 1 and 2)
3  1         2015          A           2015-06-15    400   300 (row 2 and 3)
4  1         2015          C           2015-07-12    300   300 (2,3,4)
5  1         2016          A           2016-07-15    400   400 (5)
6  1         2016          B           2016-08-09    100   250 (5,6)
7  1         2016          C           2016-09-10    400   300 (5,6,7)
8  1         2016          A           2016-10-11    100   200 (6,7,8)
9  2         2004          A           2004-07-12    200   200 (9)
10 2         2004          B           2004-08-12    300   250 (9,10)

A data.table solution would be ideal, but I'm not picky as long as it can run on fairly large data (20M rows or so) and not take until the heat death of the universe.

Any help you guys have would be greatly appreciated.

Upvotes: 3

Views: 635

Answers (2)

Felipe Alvarenga
Felipe Alvarenga

Reputation: 2652

I particularly do not like loops, but I think this one got simple enough to understand the step by step. It can be easily altered to run any other metric instead of mean (e.g. cumulative variance)

# function that drops duplicates and calculates cumulative mean
fun.attempt <- function(dat, dup, value){
  #dat: data set
  #dup: string column to look for duplicates
  #value: string column to calculate the mean

  x <- dat[!duplicated(get(dup), fromLast = T), .(get(value))]

  y <- cumsum(x) / 1:nrow(x)

  y <- y[nrow(y)]
  return(y)
}

foo[, grp := .GRP, by = .(Company, TimePeriod)] # to create a more efficient loop
hl <- list() # as storage

for(k in unique(foo$grp)){

    got <- foo[grp == k] # running the cumulative mean for each grouping

    for(y in 1:nrow(got)){
      # applying customized function
      got[y, cummean2:= fun.attempt(got[1:y], 'IndividualID', 'Value')]

    }

    hl[[k]] <- got # storing the subsetted data.tables

}

Now just compile the lists of data.tables. CumMean column is your original calculation, cummean2 is mine.

rbindlist(hl)
    Company TimePeriod IndividualID Date.Indiv.Acted Value CumMean grp cummean2
 1:       1       2015            A       2015-01-01   400     400   1      400
 2:       1       2015            B       2015-02-01   200     300   1      300
 3:       1       2015            A       2015-06-15   400     300   1      300
 4:       1       2015            C       2015-07-12   300     300   1      300
 5:       1       2016            A       2016-07-15   400     400   2      400
 6:       1       2016            B       2016-08-09   100     250   2      250
 7:       1       2016            C       2016-09-10   400     300   2      300
 8:       1       2016            A       2016-10-11   100     200   2      200
 9:       2       2004            A       2004-07-12   200     200   3      200
10:       2       2004            B       2004-08-12   300     250   3      250

Upvotes: 1

Khashaa
Khashaa

Reputation: 7373

setDT(dt)
dt[, occ := 1:.N, by = .(Company, TimePeriod, IndividualID)]
dt[, n := cumsum(!duplicated(IndividualID)), by = .(Company, TimePeriod)]
dt[, Value1 := Value,]
dt[, x := c(0, diff(Value)), by = .(Company, TimePeriod, IndividualID)]
dt[occ>1, Value1 := x,]
dt[, Cummean := cumsum(Value1)/n, by = .(Company, TimePeriod)]
dt[, c("occ", "n", "Value1", "x") := NULL][]
#    Company TimePeriod IndividualID Date.Indiv.Acted Value Cummean
# 1:       1       2015            A       2015-01-01   400     400
# 2:       1       2015            B       2015-02-01   200     300
# 3:       1       2015            A       2015-06-15   400     300
# 4:       1       2015            C       2015-07-12   300     300
# 5:       1       2016            A       2016-07-15   400     400
# 6:       1       2016            B       2016-08-09   100     250
# 7:       1       2016            C       2016-09-10   400     300
# 8:       1       2016            A       2016-10-11   100     200
# 9:       2       2004            A       2004-07-12   200     200
#10:       2       2004            B       2004-08-12   300     250

dt <- structure(list(Company = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2), TimePeriod = c(2015, 
2015, 2015, 2015, 2016, 2016, 2016, 2016, 2004, 2004), IndividualID = c("A", 
"B", "A", "C", "A", "B", "C", "A", "A", "B"), Date.Indiv.Acted = c("2015-01-01", 
"2015-02-01", "2015-06-15", "2015-07-12", "2016-07-15", "2016-08-09", 
"2016-09-10", "2016-10-11", "2004-07-12", "2004-08-12"), Value = c(400, 
200, 400, 300, 400, 100, 400, 100, 200, 300)), row.names = c(NA, 
-10L), class = "data.frame")

Upvotes: 4

Related Questions