Reputation: 2763
I have a data table with daily precipitation measurements that looks like this:
library(data.table)
pr <- data.table(date=seq(as.Date("2000-01-01"), as.Date("2004-12-31"), by="1 day"),
precip=runif(1827, 0, 20))
date precip
1: 2000-01-01 8.390553
2: 2000-01-02 11.753791
3: 2000-01-03 1.346553
4: 2000-01-04 15.130108
5: 2000-01-05 2.027514
---
1823: 2004-12-27 17.550784
1824: 2004-12-28 7.054898
1825: 2004-12-29 4.458190
1826: 2004-12-30 6.989788
1827: 2004-12-31 2.257400
I would like to get the cumulative precipitation for all growing seasons in this dataset, where growing seasons are defined as the time period between September through April.
Therefore, the resulting data table should contain the cumulative sums for Sep2000 - Apr2001, Set2001 - Apr 2002 and so on.
How can this be achieved?
Upvotes: 0
Views: 319
Reputation: 8844
Here is a data.table
approach. We first identify growing (TRUE) / harvest (FALSE) seasons, then give each season a run-length id, and finally summarise the start and end dates together with the total precipitation by season count.
set.seed(1)
pr <- data.table(
date=seq(as.Date("2000-01-01"), as.Date("2004-12-31"), by="1 day"),
precip=runif(1827, 0, 20)
)
pr[
, gr_season := !(month(date) %in% 5:8)
][
, season_count := rleidv(gr_season)
][
gr_season == TRUE, .(period = paste0(head(date, 1L), "/", tail(date, 1L)), precip = sum(precip)),
by = season_count
]
Output
season_count period precip
1: 1 2000-01-01/2000-04-30 1251.741
2: 3 2000-09-01/2001-04-30 2352.559
3: 5 2001-09-01/2002-04-30 2466.817
4: 7 2002-09-01/2003-04-30 2326.178
5: 9 2003-09-01/2004-04-30 2418.478
6: 11 2004-09-01/2004-12-31 1136.972
Daily cumsum for growing seasons
pr[
, gr_season := !(month(date) %in% 5:8)
][
, season_count := rleidv(gr_season)
][
gr_season == TRUE, .(date = date, precip = cumsum(precip)),
by = season_count
]
Output
season_count date precip
1: 1 2000-01-01 5.310173
2: 1 2000-01-02 12.752651
3: 1 2000-01-03 24.209719
4: 1 2000-01-04 42.373874
5: 1 2000-01-05 46.407513
---
1208: 11 2004-12-27 1101.280221
1209: 11 2004-12-28 1112.926760
1210: 11 2004-12-29 1114.345408
1211: 11 2004-12-30 1125.758288
1212: 11 2004-12-31 1136.971833
Upvotes: 4
Reputation: 173803
The end of April is 120 days from the start of the year. If you subtract 120 days from all entries in your date
column to create a pseudodate
column, then the pseudodate
of every day of your growing period will be shifted into the previous year. Since subtracting 120 days from 1st September gives 4th May of the same year, any pseudodate
before 4th May will be in a non-growing period, and anything on or after that date until the end of the year must be in the growing period that started in the September of that year. By this method, we can easily label each actual date according to whether it is in the growing period, and according to the year that growing period started.
Then all we need to do is filter out the dates that weren't in the growing period, group_by
the year the growing season began and perform a cumsum
on the precip
column:
library(dplyr)
library(lubridate)
pr %>% mutate(pseudodate = date - days(120),
is_growing = yday(pseudodate) > 125,
season_beginning = year(pseudodate)) %>%
filter(is_growing) %>%
mutate(cum_precip = cumsum(precip)) %>%
select(date, precip, season_beginning, cum_precip)
#> date precip season_beginning cum_precip
#> 1: 2000-01-01 17.694152 1999 17.69415
#> 2: 2000-01-02 6.066319 1999 23.76047
#> 3: 2000-01-03 4.793192 1999 28.55366
#> 4: 2000-01-04 15.753112 1999 44.30678
#> 5: 2000-01-05 12.253172 1999 56.55995
#> ---
#> 1198: 2004-12-27 8.983804 2004 11490.72677
#> 1199: 2004-12-28 6.740315 2004 11497.46709
#> 1200: 2004-12-29 3.899960 2004 11501.36705
#> 1201: 2004-12-30 6.357432 2004 11507.72448
#> 1202: 2004-12-31 3.950666 2004 11511.67515
Of course, since this example data set starts on January 1st, we have already missed a few months of the growing period for that season, so the first year would probably need removed from your actual data if it also starts in January.
Upvotes: 2