Reputation: 181
I am working on a temperature data set and I am looking to highlight days of consecutive max temperatures of over 30C and create a new TRUE/FALSE column in a data frame that highlights a heat wave. To do this I need to look at the previous 3 days data and verify whether they all meet this condition, as well as the current day itself.
I have tried to use the lag command to generate multiple new columns for -1 days, -2 days and -3 days and then use an ifelse statement to verify all columns are true. But this doesn't seem like an efficient way of doing things and I can't seem to get the lag variable to look in the past. Since multiple stations record daily data I need to also look at each weather station individually to verify a heat wave on a given day.
station_id = c(rep(1,10), rep(2,10))
date = c("20190129", "20190130", "20190131", "20190201", "20190202", "20190203", "20190204", "20190205", "20190206", "20190207")
max_temp = c(27, 35, 36, 38, 31, 32, 33, 29, 28, 26, 21, 20, 13, 31, 31, 32, 33, 35, 25, 20)
data = data.frame(station_id, date, max_temp)
print(data)
lag(data$max_temp, 3) #gives the forward lagging but i cant seem to be able to look backwards using negative numbers, perhaps I need to not include the first few columns?
Upvotes: 2
Views: 258
Reputation: 368
I would use a tidyverse
approach. Rather than using stats::lag()
, you can use dplyr::lag()
, which is a bit more straightforward for what you're trying to accomplish:
library(tidyverse)
dfr <- dfr %>%
group_by(station_id) %>%
mutate(Heat_Wave = case_when(
max_temp >= 30 & lag(max_temp, 1) >= 30 & lag(max_temp, 2) >= 30 ~ TRUE, TRUE ~ FALSE
)) %>%
ungroup()
print(dfr)
station_id date max_temp Heat_Wave
1 1 20190129 27 FALSE
2 1 20190130 35 FALSE
3 1 20190131 36 FALSE
4 1 20190201 38 TRUE
5 1 20190202 31 TRUE
6 1 20190203 32 TRUE
7 1 20190204 33 TRUE
8 1 20190205 29 FALSE
9 1 20190206 28 FALSE
10 1 20190207 26 FALSE
11 2 20190129 21 FALSE
12 2 20190130 20 FALSE
13 2 20190131 13 FALSE
14 2 20190201 31 FALSE
15 2 20190202 31 FALSE
16 2 20190203 32 TRUE
17 2 20190204 33 TRUE
18 2 20190205 35 TRUE
19 2 20190206 25 FALSE
20 2 20190207 20 FALSE
Upvotes: 1
Reputation: 23
I have one small correction to jon's answer. You should use group_by(data,station_id)
from tidyverse to run the function individually for each station ID - otherwise, results will spill between the stations. It doesn't matter in your example, but if your data had something like:
station_id date max_temp
1 20190204 15
1 20190205 30
1 20190206 30
1 20190207 30
2 20190129 30
It would be a problem. You wouldn't want the last row to return TRUE. So the new function is:
data<-group_by(data,station_id)%>%
mutate(heatwave=case_when(
max_temp>=30 & lag(max_temp,1)>=30 & lag(max_temp,2)>=30 ~ TRUE, TRUE ~ FALSE))
Don't forget to ungroup()
afterwards to avoid unexpected things happening!
Upvotes: 0