Henky
Henky

Reputation: 69

Conditional sum in rolling window

I am quite new to R, so apologies in advance if I state something wrong :)

I have a dataframe consisting of 395 rows and 4973 columns, sorted by months, with number of occurrence per month (ranges from 0 to eg 25) for a lot of companies. The number of occurrence was summarised from daily data grouped by month and year. My dataframe df looks something like that (only a few months and 3 companies):

Date     FirmA FirmB FirmC
01-2015  20    NA    20
02-2015  21    2     1
03-2015  22    3     2
04-2015  24    7     5
05-2015  10    10    10
06-2015  9     20    2
07-2015  13    22    1
08-2015  20    19    1

I have now the task to sum up the occurences per company by a three month rolling window from months t-3 to t-1 (the 3 previous months). However, the sum should have following conditions. It should have at least 10 occurrences during the three month window and at least 3 occurrences in month t-1. It doesn't matter if an NA is in t-3 and/or t-2, as long as the two conditions are met.

It should look like that.

Date     FirmA FirmB FirmC
01-2015  NA    NA    NA
02-2015  20    NA    20
03-2015  41    NA    NA
04-2015  63    NA    NA
05-2015  67    12    NA
06-2015  56    20    17
07-2015  43    37    NA
08-2015  32    52    NA

I have no clue, how to approach that, especially the combination of rolling window/sum (probably something with lag) and the conditions regarding which numbers to use and which not.

Upvotes: 2

Views: 785

Answers (2)

alexis_laz
alexis_laz

Reputation: 13132

Another approach, similar in concept to r2evans', is to compute the rolling sum through cumsum (after replacing NAs with 0s) and insert NAs when conditions are not met:

ff = function(x, w = 3, ntot = 10, nlast = 3)
{
    x[is.na(x)] = 0L
    x = c(0L, x[-length(x)])

    cs = cumsum(x)
    wcs = cs - c(numeric(w), cs[1:(length(x) - w)])

    wcs[!((wcs >= ntot) & (x >= nlast))] = NA
    return(wcs)

}

sapply(df[-1], ff)  # 'df' borrowed from r2evans' answer
#     FirmA FirmB FirmC
#[1,]    NA    NA    NA
#[2,]    20    NA    20
#[3,]    41    NA    NA
#[4,]    63    NA    NA
#[5,]    67    12    NA
#[6,]    56    20    17
#[7,]    43    37    NA
#[8,]    32    52    NA

Upvotes: 1

r2evans
r2evans

Reputation: 161085

Here's a method that uses zoo::rollapply:

df <- structure(list(Date = c("01-2015", "02-2015", "03-2015", "04-2015", 
"05-2015", "06-2015", "07-2015", "08-2015"), FirmA = c(20L, 21L, 
22L, 24L, 10L, 9L, 13L, 20L), FirmB = c(NA, 2L, 3L, 7L, 10L, 
20L, 22L, 19L), FirmC = c(20L, 1L, 2L, 5L, 10L, 2L, 1L, 1L)), .Names = c("Date", 
"FirmA", "FirmB", "FirmC"), class = "data.frame", row.names = c(NA, 
-8L))

library(zoo)

mysum <- function(x, minprev = 3) {
  l <- length(x)
  if (l==1 || (! is.na(x[l-1]) && x[l-1] >= minprev)) sum(x[-l], na.rm = TRUE) else NA
}

winsize <- 3
# conditionally-sum
df[-1] <- lapply(df[-1], function(z) rollapply(z, winsize + 1, mysum, partial = TRUE, align = "right"))
# remove those that are insufficient in total
df[-1] <- lapply(df[-1], function(z) ifelse(z <= 10, NA, z))
df
#      Date FirmA FirmB FirmC
# 1 01-2015    NA    NA    NA
# 2 02-2015    20    NA    20
# 3 03-2015    41    NA    NA
# 4 04-2015    63    NA    NA
# 5 05-2015    67    12    NA
# 6 06-2015    56    20    17
# 7 07-2015    43    37    NA
# 8 08-2015    32    52    NA

There may be a way to not require mysum, but two things make it slightly tricky: (1) the resulting sum goes in the next field (more easily side-stepped if the window is always length 3), and (2) the conditional on the last value. It's certainly feasible to try to smooth it out, but this works well enough.

Upvotes: 2

Related Questions