Aesler
Aesler

Reputation: 181

Create new data frame column conditional on previous days data (lagged variable)

I am working on a temperature data set and I am looking to highlight days of consecutive max temperatures of over 30C and create a new TRUE/FALSE column in a data frame that highlights a heat wave. To do this I need to look at the previous 3 days data and verify whether they all meet this condition, as well as the current day itself.

I have tried to use the lag command to generate multiple new columns for -1 days, -2 days and -3 days and then use an ifelse statement to verify all columns are true. But this doesn't seem like an efficient way of doing things and I can't seem to get the lag variable to look in the past. Since multiple stations record daily data I need to also look at each weather station individually to verify a heat wave on a given day.

station_id = c(rep(1,10), rep(2,10))
date = c("20190129", "20190130", "20190131", "20190201", "20190202", "20190203", "20190204", "20190205", "20190206", "20190207")
max_temp = c(27, 35, 36, 38, 31, 32, 33, 29, 28, 26, 21, 20, 13, 31, 31, 32, 33, 35, 25, 20)
data = data.frame(station_id, date, max_temp)
print(data)

lag(data$max_temp, 3) #gives the forward lagging but i cant seem to be able to look backwards using negative numbers, perhaps I need to not include the first few columns?

Upvotes: 2

Views: 258

Answers (2)

jon
jon

Reputation: 368

I would use a tidyverse approach. Rather than using stats::lag(), you can use dplyr::lag(), which is a bit more straightforward for what you're trying to accomplish:

library(tidyverse)
dfr <- dfr %>%
  group_by(station_id) %>%
  mutate(Heat_Wave = case_when(
    max_temp >= 30 & lag(max_temp, 1) >= 30 & lag(max_temp, 2) >= 30 ~ TRUE, TRUE ~ FALSE
  )) %>%
  ungroup()


print(dfr)
   station_id     date max_temp Heat_Wave
1           1 20190129       27     FALSE
2           1 20190130       35     FALSE
3           1 20190131       36     FALSE
4           1 20190201       38      TRUE
5           1 20190202       31      TRUE
6           1 20190203       32      TRUE
7           1 20190204       33      TRUE
8           1 20190205       29     FALSE
9           1 20190206       28     FALSE
10          1 20190207       26     FALSE
11          2 20190129       21     FALSE
12          2 20190130       20     FALSE
13          2 20190131       13     FALSE
14          2 20190201       31     FALSE
15          2 20190202       31     FALSE
16          2 20190203       32      TRUE
17          2 20190204       33      TRUE
18          2 20190205       35      TRUE
19          2 20190206       25     FALSE
20          2 20190207       20     FALSE

Upvotes: 1

eable
eable

Reputation: 23

I have one small correction to jon's answer. You should use group_by(data,station_id) from tidyverse to run the function individually for each station ID - otherwise, results will spill between the stations. It doesn't matter in your example, but if your data had something like:

   station_id     date max_temp
            1 20190204       15
            1 20190205       30
            1 20190206       30
            1 20190207       30
            2 20190129       30

It would be a problem. You wouldn't want the last row to return TRUE. So the new function is:

data<-group_by(data,station_id)%>%
    mutate(heatwave=case_when(
max_temp>=30 & lag(max_temp,1)>=30 & lag(max_temp,2)>=30 ~ TRUE, TRUE ~ FALSE))

Don't forget to ungroup() afterwards to avoid unexpected things happening!

Upvotes: 0

Related Questions