Reputation: 1293
I have a dataframe of time series data with daily observations of temperatures. I need to create a dummy variable that counts each day that has temperature above a threshold of 5C. This would be easy in itself, but an additional condition exists: counting starts only after ten consecutive days above the threshold occurs. Here's an example dataframe:
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
I think I got it done, but with too many loops for my liking. This is what I did:
df$dummyUnconditional <- 0
df$dummyHead <- 0
df$dummyTail <- 0
for(i in 1:nrow(df)){
if(df$temp[i] > 5){
df$dummyUnconditional[i] <- 1
}
}
for(i in 1:(nrow(df)-9)){
if(sum(df$dummyUnconditional[i:(i+9)]) == 10){
df$dummyHead[i] <- 1
}
}
for(i in 9:nrow(df)){
if(sum(df$dummyUnconditional[(i-9):i]) == 10){
df$dummyTail[i] <- 1
}
}
df$dummyConditional <- ifelse(df$dummyHead == 1 | df$dummyTail == 1, 1, 0)
Could anyone suggest simpler ways for doing this?
Upvotes: 5
Views: 5082
Reputation: 70296
Here's a base R option using rle
:
df$dummy <- with(rle(df$temp > 5), rep(as.integer(values & lengths >= 10), lengths))
Some explanation: The task is a classic use case for the run length encoding (rle
) function, imo. We first check if the value of temp
is greater than 5 (creating a logical vector) and apply rle
on that vector resulting in:
> rle(df$temp > 5)
#Run Length Encoding
# lengths: int [1:7] 66 1 1 225 2 1 69
# values : logi [1:7] FALSE TRUE FALSE TRUE FALSE TRUE ...
Now we want to find those cases where the values
is TRUE
(i.e. temp is greater than 5) and where at the same time the lengths
is greater than 10 (i.e. at least ten consecutive temp
values are greater than 5). We do this by running:
values & lengths >= 10
And finally, since we want to return a vector of the same lengths as nrow(df)
, we use rep(..., lengths)
and as.integer
in order to return 1/0 instead of TRUE
/FALSE
.
Upvotes: 5
Reputation: 132864
I would do this:
set.seed(42)
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
thr <- 5
df$dum <- 0
#find first 10 consecutive values above threshold
test1 <- filter(df$temp > thr, rep(1,10), sides = 1) == 10L
test1[1:9] <- FALSE
n <- which(cumsum(test1) == 1L)
#count days above threshold after that
df$dum[(n+1):nrow(df)] <- cumsum(df$temp[(n+1):nrow(df)] > thr)
Upvotes: 2
Reputation: 1803
I think you could use a combination of a simple ifelse and the roll apply function in the zoo package to achieve what you are looking for. The final step just involves padding the result to account for the first N-1 days where there isnt enough information to fill the window.
library(zoo)
df <- data.frame(date = seq(365),
temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
df$above5 <- ifelse(df$temp > 5, 1, 0)
temp <- rollapply(df$above5, 10, sum)
df$conseq <- c(rep(0, 9),temp)
Upvotes: 5