Reputation: 1692
I have a dataframe that represents a two-year daily time series of temperature for one river. For this river, I would like to know what day of year (doy
):
I run into errors when I try to calculate 2 because there are multiple TRUE
answers for the code to choose from. I would like to know how I can make the code go with the first TRUE
answer if there are multiple TRUE
answers.
Example Dataset
library(ggplot2)
library(lubridate)
library(dplyr)
library(dataRetrieval)
siteNumber <- "01417500"
parameterCd <- "00010" # water temperature
statCd <- "00003" # mean
startDate <- "2015-01-01"
endDate <- "2016-12-31"
dat <- readNWISdv(siteNumber, parameterCd, startDate, endDate, statCd=statCd)
dat <- dat[,c(2:4)]
colnames(dat)[3] <- "temperature"
# Visually inspect the time series
ggplot(data = dat, aes(x = Date, y = temperature)) +
geom_point() +
theme_bw()
Code for 1 & 2 where 2 is having issues because there are multiple TRUE
statements to choose from
dat %>%
mutate(year = year(Date),
doy = yday(Date)) %>%
group_by(year) %>%
mutate(gt_10 = temperature >= 10, # greater than or equal to 10 degrees
lt_10 = temperature <= 10, # less than or equal to 10 degrees
peak_doy = doy[which.max(temperature)], # what doy is max temperature
below_peak = doy < peak_doy, # is the observed doy less than the peak temperature doy
after_peak = doy > peak_doy, # is the observed doy greater than the peak temperature doy
test_above = ave(gt_10, cumsum(!gt_10), FUN = cumsum), # counts number of days above 10 degree threshold
test_below = ave(lt_10, cumsum(!lt_10), FUN = cumsum)) %>% # counts number of days below 10 degree threshold
summarise(first_above_10_sustained = doy[below_peak == T & test_above == 14]-13, # answer to 1
first_below_10_sustained = doy[after_peak == T & test_below == 14]-13) # answer to 2
after_peak == T
) and when temperature has been below the 10 threshold for 14 consecutive days (i.e., test_below == 14
). The test_below == 14
is where the error lies because there are multiple times when this occurs. Yes, you could change the threshold of consecutive days to some value > 14 but that is beside the point. How can I get the code to accept the first TRUE
answer is there are multiple TRUE
answers?I have a similar SO question here but my answer only works when there is not multiple TRUE
answers to choose from.
Upvotes: 2
Views: 65
Reputation: 1692
Here is an answer that builds off of the original approach to the question and uses parts proposed by @AllanCameron
This works by splitting the annual time series into times before and after the peak temperature. Then a run-length ID column (run
) is added. This allows us to determine where the run
is max for the subset of run
when the values are TRUE
in below_peak
or after_peak
. We can then extract the first element of doy
.
dat %>%
mutate(year = year(Date),
doy = yday(Date)) %>%
group_by(site_no, year) %>%
mutate(gt_10 = temperature >= 10,
lt_10 = temperature <= 10,
peak_doy = doy[which.max(temperature)],
below_peak = doy < peak_doy,
after_peak = doy > peak_doy,
run = data.table::rleid(lt_10)) %>%
summarise(sustain_above = first(doy[run == max(run[below_peak])]),
sustain_below = first(doy[run == max(run[after_peak])]), .groups = 'drop')
Upvotes: 1
Reputation: 173793
There are a couple of tricks I would employ here:
rleid
of this column, which will group all of the consecutive days above or below the threshold of 10 degrees.rleid
which contains that maximum temperature will be the dates where temperature is sustained > 10 degrees for that year according to your definitiondf <- dat %>%
mutate(year = year(Date)) %>%
group_by(year) %>%
mutate(max_temp = max(temperature)) %>%
ungroup() %>%
mutate(above_ten = temperature >= 10,
run = factor(data.table::rleid(above_ten))) %>%
group_by(run) %>%
mutate(sustained_hi = max(temperature) == max(max_temp)) %>%
ungroup() %>%
mutate(year = year(Date - months(6))) %>%
group_by(year) %>%
mutate(min_temp = min(temperature)) %>%
group_by(run) %>%
mutate(sustained_lo = min(temperature) == min(min_temp)) %>%
mutate(group = ifelse(sustained_hi, 'High',
ifelse(sustained_lo, 'Low',
'Unsustained'))) %>%
select(site_no, Date, temperature, group, run)
This results in:
df
#> # A tibble: 731 x 5
#> # Groups: run [27]
#> site_no Date temperature group run
#> <chr> <date> <dbl> <chr> <fct>
#> 1 01417500 2015-01-01 0.7 Low 1
#> 2 01417500 2015-01-02 1.1 Low 1
#> 3 01417500 2015-01-03 1 Low 1
#> 4 01417500 2015-01-04 2.5 Low 1
#> 5 01417500 2015-01-05 2 Low 1
#> 6 01417500 2015-01-06 0.3 Low 1
#> 7 01417500 2015-01-07 0.2 Low 1
#> 8 01417500 2015-01-08 0.2 Low 1
#> 9 01417500 2015-01-09 0.3 Low 1
#> 10 01417500 2015-01-10 0.3 Low 1
#> # ... with 721 more rows
#> # i Use `print(n = ...)` to see more rows
And we can see the result by plotting like this:
ggplot(df, aes(x = Date, y = temperature, color = group)) +
geom_point() +
scale_color_manual(limits = c('High', 'Unsustained', 'Low'),
values = c('orange', 'gray', 'steelblue')) +
theme_bw()
And we can get a nice little data frame of the start and end dates of our sustained high / low temperatures by doing:
df %>%
filter(group != 'Unsustained') %>%
group_by(run) %>%
summarize(Date = c(first(Date), last(Date)),
Event = paste('Sustained', first(group), c('Start', 'End'))) %>%
ungroup() %>%
select(-run)
#> # A tibble: 10 x 2
#> Date Event
#> <date> <chr>
#> 1 2015-01-01 Sustained Low Start
#> 2 2015-04-28 Sustained Low End
#> 3 2015-04-29 Sustained High Start
#> 4 2015-07-16 Sustained High End
#> 5 2015-11-08 Sustained Low Start
#> 6 2016-03-31 Sustained Low End
#> 7 2016-05-18 Sustained High Start
#> 8 2016-10-09 Sustained High End
#> 9 2016-10-23 Sustained Low Start
#> 10 2016-12-31 Sustained Low End
Upvotes: 7