Reputation: 23
My data set has flow rate measurements of a river for every day of the year from 1967 to 2021. This is split up into seasons: Winter (December, Jan, Feb), Spring (March, April, May), Summer (June, July, August) and Autumn (September, October, November).
This is a sample of my data set:
> (south_newton_wylye)
# A tibble: 20,100 x 7
river year season month date flow_rate quality
<chr> <fct> <fct> <fct> <dttm> <dbl> <chr>
1 wylye 1967 Winter January 1967-01-01 00:00:00 6.67 Good
2 wylye 1967 Winter January 1967-01-02 00:00:00 6.39 Good
3 wylye 1967 Winter January 1967-01-03 00:00:00 6.32 Good
4 wylye 1967 Winter January 1967-01-04 00:00:00 6.34 Good
5 wylye 1967 Winter January 1967-01-05 00:00:00 6.37 Good
6 wylye 1967 Winter January 1967-01-06 00:00:00 6.45 Good
7 wylye 1967 Winter January 1967-01-07 00:00:00 6.65 Good
8 wylye 1967 Winter January 1967-01-08 00:00:00 6.54 Good
9 wylye 1967 Winter January 1967-01-09 00:00:00 6.53 Good
10 wylye 1967 Winter January 1967-01-10 00:00:00 6.62 Good
# ... with 20,090 more rows
I would like to find the mean flow rate of the seasons for each year. I am struggling to find a code for the winter season which runs across two years (e.g. December 1967, Jan 1977, Feb 1977).
This was my initial code:
stats.3 <- south_newton_wylye %>% group_by(season, year) %>%
summarise(mean = mean(flow_rate), sd = sd(flow_rate), n = n(),
se = sd/sqrt(n))
stats.3
But for the winter season it includes months of the same year (Jan, Feb, Dec 1967) and not a winter season which starts in December and carries on to Jan and Feb the following year. I would also like another code which does everything I have mentioned but doesn't include the Autumn season and only includes winter, spring and summer. Does anyone know how I can go about this? Thanks :)
Upvotes: 2
Views: 402
Reputation: 818
Your problem is easier to solve by far if you use your date
variable.
Using dplyr
:
dates <- as.Date(c("1966-12-01","1967-01-01","1967-02-01","1967-03-01","1967-04-01","1967-05-01","1967-06-01","1967-07-01","1967-08-01","1967-09-01","1967-10-01","1967-11-01","1967-12-01"))
season <- c("Winter","Winter","Winter","Spring","Spring","Spring","Summer","Summer","Summer","Automn","Automn","Automn","Winter")
var <- c(1,2,3,5,5,5,7,7,7,9,9,9,10)
df <- data.frame(dates,season,var) %>% # creating the dataframe
dplyr::mutate(month = as.numeric(format(dates,"%m")),
year = as.numeric(format(dates,"%Y")),
season_id = (12*year + month) %/% 3) %>% #generating an identifiant for every season that exists in your data
dplyr::group_by(season_id) %>% # Grouping by the id
dplyr::summarise(var = mean(var)) # Computing the statistics you need
Note that with this solution you do not need to have 3 values for each season for this code to work. Also note that the years have to be consecutive, but it is probably what you meant in your original post.
A bit more explanation :
To add intelligible labels to season ids :
df <- data.frame(dates,season,var) %>% # creating the dataframe
dplyr::mutate(month = as.numeric(format(dates,"%m")),
year = as.numeric(format(dates,"%Y")),
season_id = (12*year + month) %/% 3) %>% #generating an identifiant for every season that exists in your data
dplyr::group_by(season_id) %>% # Grouping by the id
dplyr::mutate(season_label = paste(min(year),season)) %>% ## or max, it depends on your definition of a "winter of a year"
dplyr::group_by(season_id,season_label) %>% ## season_label to keep the newly created label after the arriving summarise
dplyr::summarise(var = mean(var)) # Computing the statistics you need
Also mind that you should keep season_id
or you will struggle if you need to sort your data.
Upvotes: 2