Reputation: 69
I am quite new to R, so apologies in advance if I state something wrong :)
I have a dataframe consisting of 395 rows and 4973 columns, sorted by months, with number of occurrence per month (ranges from 0 to eg 25) for a lot of companies. The number of occurrence was summarised from daily data grouped by month and year. My dataframe df looks something like that (only a few months and 3 companies):
Date FirmA FirmB FirmC
01-2015 20 NA 20
02-2015 21 2 1
03-2015 22 3 2
04-2015 24 7 5
05-2015 10 10 10
06-2015 9 20 2
07-2015 13 22 1
08-2015 20 19 1
I have now the task to sum up the occurences per company by a three month rolling window from months t-3 to t-1 (the 3 previous months). However, the sum should have following conditions. It should have at least 10 occurrences during the three month window and at least 3 occurrences in month t-1. It doesn't matter if an NA is in t-3 and/or t-2, as long as the two conditions are met.
It should look like that.
Date FirmA FirmB FirmC
01-2015 NA NA NA
02-2015 20 NA 20
03-2015 41 NA NA
04-2015 63 NA NA
05-2015 67 12 NA
06-2015 56 20 17
07-2015 43 37 NA
08-2015 32 52 NA
I have no clue, how to approach that, especially the combination of rolling window/sum (probably something with lag) and the conditions regarding which numbers to use and which not.
Upvotes: 2
Views: 785
Reputation: 13132
Another approach, similar in concept to r2evans', is to compute the rolling sum through cumsum
(after replacing NA
s with 0
s) and insert NA
s when conditions are not met:
ff = function(x, w = 3, ntot = 10, nlast = 3)
{
x[is.na(x)] = 0L
x = c(0L, x[-length(x)])
cs = cumsum(x)
wcs = cs - c(numeric(w), cs[1:(length(x) - w)])
wcs[!((wcs >= ntot) & (x >= nlast))] = NA
return(wcs)
}
sapply(df[-1], ff) # 'df' borrowed from r2evans' answer
# FirmA FirmB FirmC
#[1,] NA NA NA
#[2,] 20 NA 20
#[3,] 41 NA NA
#[4,] 63 NA NA
#[5,] 67 12 NA
#[6,] 56 20 17
#[7,] 43 37 NA
#[8,] 32 52 NA
Upvotes: 1
Reputation: 161085
Here's a method that uses zoo::rollapply
:
df <- structure(list(Date = c("01-2015", "02-2015", "03-2015", "04-2015",
"05-2015", "06-2015", "07-2015", "08-2015"), FirmA = c(20L, 21L,
22L, 24L, 10L, 9L, 13L, 20L), FirmB = c(NA, 2L, 3L, 7L, 10L,
20L, 22L, 19L), FirmC = c(20L, 1L, 2L, 5L, 10L, 2L, 1L, 1L)), .Names = c("Date",
"FirmA", "FirmB", "FirmC"), class = "data.frame", row.names = c(NA,
-8L))
library(zoo)
mysum <- function(x, minprev = 3) {
l <- length(x)
if (l==1 || (! is.na(x[l-1]) && x[l-1] >= minprev)) sum(x[-l], na.rm = TRUE) else NA
}
winsize <- 3
# conditionally-sum
df[-1] <- lapply(df[-1], function(z) rollapply(z, winsize + 1, mysum, partial = TRUE, align = "right"))
# remove those that are insufficient in total
df[-1] <- lapply(df[-1], function(z) ifelse(z <= 10, NA, z))
df
# Date FirmA FirmB FirmC
# 1 01-2015 NA NA NA
# 2 02-2015 20 NA 20
# 3 03-2015 41 NA NA
# 4 04-2015 63 NA NA
# 5 05-2015 67 12 NA
# 6 06-2015 56 20 17
# 7 07-2015 43 37 NA
# 8 08-2015 32 52 NA
There may be a way to not require mysum
, but two things make it slightly tricky: (1) the resulting sum goes in the next field (more easily side-stepped if the window is always length 3), and (2) the conditional on the last value. It's certainly feasible to try to smooth it out, but this works well enough.
Upvotes: 2