Reputation: 1204
How would you subset a time-series data.frame based on time and a threshold value?
I have this data:
year <- seq(2000, 2009, 1)
v1 <- sample(1:10, 10, replace=T)
df <- data.frame(year, v1)
That looks like this:
> df
year v1
1 2000 9
2 2001 4
3 2002 5
4 2003 4
5 2004 5
6 2005 3
7 2006 3
8 2007 3
9 2008 9
10 2009 6
I want to subset the data by groups of sequential years for which the summed score on v1
exceeds the value of 10.
On this example data the first subset should hold observations of the year 2000 & 2001. The second subset should hold the observations of year 2002, 2003 and 2004.
The real data has about 8 million observations covering 120 years.
Upvotes: 1
Views: 561
Reputation: 215127
You can implement a customized cumsum
using Reduce
function, reset the sum when the total exceeds 10 and at same time increment a count as group variable:
library(data.table)
transpose(Reduce(function(x, y) if(x[1] > 10) c(y, x[2]+1) else c(x[1] + y, x[2]),
init = c(0, 1), df$v1, accumulate = T))[[2]][-1]
# here the init parameter will take two parameters, the first one keep track of the cumsum,
# and the second one serves as a group variable, when the sum exceeds 10, reset the sum to
# zero and increase the group variable by one
# [1] 1 1 2 2 2 3 3 3 3 4
It takes around 20 seconds to run over 10 million observations vector:
v = sample(1:10, 10000000, replace = T)
system.time(transpose(Reduce(function(x, y) if(x[1] > 10) c(y, x[2]+1) else c(x[1] + y, x[2]), init = c(0, 1), v, accumulate = T))[[2]])
# user system elapsed
# 19.509 0.552 20.081
Upvotes: 3