tassones
tassones

Reputation: 1692

In R, if there are multiple TRUE answers, select the first TRUE answer

I have a dataframe that represents a two-year daily time series of temperature for one river. For this river, I would like to know what day of year (doy):

  1. temperature is sustained greater than or equal to 10 degrees
  1. temperature is sustained less than or equal to 10 degrees

I run into errors when I try to calculate 2 because there are multiple TRUE answers for the code to choose from. I would like to know how I can make the code go with the first TRUE answer if there are multiple TRUE answers.

Example Dataset

library(ggplot2)
library(lubridate)
library(dplyr)
library(dataRetrieval)

siteNumber <- "01417500"
parameterCd <- "00010" # water temperature
statCd <- "00003" # mean
startDate <- "2015-01-01"
endDate <- "2016-12-31"

dat <- readNWISdv(siteNumber, parameterCd, startDate, endDate, statCd=statCd)
dat <- dat[,c(2:4)]
colnames(dat)[3] <- "temperature"

# Visually inspect the time series
ggplot(data = dat, aes(x = Date, y = temperature)) +
  geom_point() +
  theme_bw()

Code for 1 & 2 where 2 is having issues because there are multiple TRUE statements to choose from

dat %>%
  mutate(year = year(Date),
         doy = yday(Date)) %>%
  group_by(year) %>%
  mutate(gt_10 = temperature >= 10, # greater than or equal to 10 degrees
         lt_10 = temperature <= 10, # less than or equal to 10 degrees
         peak_doy = doy[which.max(temperature)], # what doy is max temperature
         below_peak = doy < peak_doy, # is the observed doy less than the peak temperature doy
         after_peak = doy > peak_doy, # is the observed doy greater than the peak temperature doy
         test_above = ave(gt_10, cumsum(!gt_10), FUN = cumsum), # counts number of days above 10 degree threshold
         test_below = ave(lt_10, cumsum(!lt_10), FUN = cumsum)) %>% # counts number of days below 10 degree threshold
  summarise(first_above_10_sustained = doy[below_peak == T & test_above == 14]-13, # answer to 1 
            first_below_10_sustained = doy[after_peak == T & test_below == 14]-13) # answer to 2

I have a similar SO question here but my answer only works when there is not multiple TRUE answers to choose from.

Upvotes: 2

Views: 65

Answers (2)

tassones
tassones

Reputation: 1692

Here is an answer that builds off of the original approach to the question and uses parts proposed by @AllanCameron

This works by splitting the annual time series into times before and after the peak temperature. Then a run-length ID column (run) is added. This allows us to determine where the run is max for the subset of run when the values are TRUE in below_peak or after_peak. We can then extract the first element of doy.

dat %>%
  mutate(year = year(Date),
         doy = yday(Date)) %>%
  group_by(site_no, year) %>%
  mutate(gt_10 = temperature >= 10,
         lt_10 = temperature <= 10,
         peak_doy = doy[which.max(temperature)],
         below_peak = doy < peak_doy,
         after_peak = doy > peak_doy,
         run = data.table::rleid(lt_10)) %>%
  summarise(sustain_above = first(doy[run == max(run[below_peak])]),
            sustain_below = first(doy[run == max(run[after_peak])]), .groups = 'drop')

Upvotes: 1

Allan Cameron
Allan Cameron

Reputation: 173793

There are a couple of tricks I would employ here:

  1. Create a column that specifies whether the temperature is above or below 10 degrees.
  2. Get the rleid of this column, which will group all of the consecutive days above or below the threshold of 10 degrees.
  3. Find the maximum temperature of each year, and store it in a column.
  4. The rleid which contains that maximum temperature will be the dates where temperature is sustained > 10 degrees for that year according to your definition
  5. Do the same thing with minimum temperatures, but subtract 6 months when calculating the year to group by when calculating the minimum for that year. This will not make a difference to the end result but allows a calculation of the minimum temperature in the winter period:
df <- dat %>%
  mutate(year = year(Date)) %>%
  group_by(year) %>%
  mutate(max_temp = max(temperature)) %>%
  ungroup() %>%
  mutate(above_ten = temperature >= 10,
         run = factor(data.table::rleid(above_ten))) %>%
  group_by(run) %>%
  mutate(sustained_hi = max(temperature) == max(max_temp)) %>%
  ungroup() %>%
  mutate(year = year(Date - months(6))) %>%
  group_by(year) %>%
  mutate(min_temp = min(temperature)) %>%
  group_by(run) %>%
  mutate(sustained_lo = min(temperature) == min(min_temp)) %>%
  mutate(group = ifelse(sustained_hi, 'High', 
                        ifelse(sustained_lo, 'Low', 
                               'Unsustained'))) %>%
  select(site_no, Date, temperature, group, run)

This results in:

df
#> # A tibble: 731 x 5
#> # Groups:   run [27]
#>    site_no  Date       temperature group run  
#>    <chr>    <date>           <dbl> <chr> <fct>
#>  1 01417500 2015-01-01         0.7 Low   1    
#>  2 01417500 2015-01-02         1.1 Low   1    
#>  3 01417500 2015-01-03         1   Low   1    
#>  4 01417500 2015-01-04         2.5 Low   1    
#>  5 01417500 2015-01-05         2   Low   1    
#>  6 01417500 2015-01-06         0.3 Low   1    
#>  7 01417500 2015-01-07         0.2 Low   1    
#>  8 01417500 2015-01-08         0.2 Low   1    
#>  9 01417500 2015-01-09         0.3 Low   1    
#> 10 01417500 2015-01-10         0.3 Low   1    
#> # ... with 721 more rows
#> # i Use `print(n = ...)` to see more rows

And we can see the result by plotting like this:

ggplot(df, aes(x = Date, y = temperature, color = group)) +
  geom_point() +
  scale_color_manual(limits = c('High', 'Unsustained', 'Low'),
                     values = c('orange', 'gray', 'steelblue')) +
  theme_bw()

enter image description here

And we can get a nice little data frame of the start and end dates of our sustained high / low temperatures by doing:

df %>% 
  filter(group != 'Unsustained') %>%
  group_by(run) %>%
  summarize(Date = c(first(Date), last(Date)),
            Event = paste('Sustained', first(group), c('Start', 'End'))) %>%
  ungroup() %>%
  select(-run)
#> # A tibble: 10 x 2
#>    Date       Event               
#>    <date>     <chr>               
#>  1 2015-01-01 Sustained Low Start 
#>  2 2015-04-28 Sustained Low End   
#>  3 2015-04-29 Sustained High Start
#>  4 2015-07-16 Sustained High End  
#>  5 2015-11-08 Sustained Low Start 
#>  6 2016-03-31 Sustained Low End   
#>  7 2016-05-18 Sustained High Start
#>  8 2016-10-09 Sustained High End  
#>  9 2016-10-23 Sustained Low Start 
#> 10 2016-12-31 Sustained Low End 

Upvotes: 7

Related Questions