wake_wake
wake_wake

Reputation: 1204

r - Subsetting time-series data.frame based on time and threshold

How would you subset a time-series data.frame based on time and a threshold value?

I have this data:

year <- seq(2000, 2009, 1)
v1 <- sample(1:10, 10, replace=T)
df <- data.frame(year, v1)

That looks like this:

> df
  year v1
1  2000  9
2  2001  4
3  2002  5
4  2003  4
5  2004  5
6  2005  3
7  2006  3
8  2007  3
9  2008  9
10 2009  6

I want to subset the data by groups of sequential years for which the summed score on v1 exceeds the value of 10.

On this example data the first subset should hold observations of the year 2000 & 2001. The second subset should hold the observations of year 2002, 2003 and 2004.

The real data has about 8 million observations covering 120 years.

Upvotes: 1

Views: 561

Answers (1)

akuiper
akuiper

Reputation: 215127

You can implement a customized cumsum using Reduce function, reset the sum when the total exceeds 10 and at same time increment a count as group variable:

library(data.table)
transpose(Reduce(function(x, y) if(x[1] > 10) c(y, x[2]+1) else c(x[1] + y, x[2]), 
                 init = c(0, 1), df$v1, accumulate = T))[[2]][-1]

# here the init parameter will take two parameters, the first one keep track of the cumsum,
# and the second one serves as a group variable, when the sum exceeds 10, reset the sum to 
# zero and increase the group variable by one

# [1] 1 1 2 2 2 3 3 3 3 4

It takes around 20 seconds to run over 10 million observations vector:

v = sample(1:10, 10000000, replace = T)
system.time(transpose(Reduce(function(x, y) if(x[1] > 10) c(y, x[2]+1) else c(x[1] + y, x[2]), init = c(0, 1), v, accumulate = T))[[2]])

#   user  system elapsed 
# 19.509   0.552  20.081 

Upvotes: 3

Related Questions