thiagoveloso
thiagoveloso

Reputation: 2763

R - Cumulative sum that crosses years

I have a data table with daily precipitation measurements that looks like this:

library(data.table)
pr <- data.table(date=seq(as.Date("2000-01-01"), as.Date("2004-12-31"), by="1 day"),
                 precip=runif(1827, 0, 20))
            date    precip
   1: 2000-01-01  8.390553
   2: 2000-01-02 11.753791
   3: 2000-01-03  1.346553
   4: 2000-01-04 15.130108
   5: 2000-01-05  2.027514
  ---                     
1823: 2004-12-27 17.550784
1824: 2004-12-28  7.054898
1825: 2004-12-29  4.458190
1826: 2004-12-30  6.989788
1827: 2004-12-31  2.257400

I would like to get the cumulative precipitation for all growing seasons in this dataset, where growing seasons are defined as the time period between September through April.

Therefore, the resulting data table should contain the cumulative sums for Sep2000 - Apr2001, Set2001 - Apr 2002 and so on.

How can this be achieved?

Upvotes: 0

Views: 319

Answers (2)

ekoam
ekoam

Reputation: 8844

Here is a data.table approach. We first identify growing (TRUE) / harvest (FALSE) seasons, then give each season a run-length id, and finally summarise the start and end dates together with the total precipitation by season count.

set.seed(1)
pr <- data.table(
  date=seq(as.Date("2000-01-01"), as.Date("2004-12-31"), by="1 day"),
  precip=runif(1827, 0, 20)
)

pr[
  , gr_season := !(month(date) %in% 5:8)
][
  , season_count := rleidv(gr_season) 
][
  gr_season == TRUE, .(period = paste0(head(date, 1L), "/", tail(date, 1L)), precip = sum(precip)), 
  by = season_count
]

Output

   season_count                period   precip
1:            1 2000-01-01/2000-04-30 1251.741
2:            3 2000-09-01/2001-04-30 2352.559
3:            5 2001-09-01/2002-04-30 2466.817
4:            7 2002-09-01/2003-04-30 2326.178
5:            9 2003-09-01/2004-04-30 2418.478
6:           11 2004-09-01/2004-12-31 1136.972

Daily cumsum for growing seasons

pr[
  , gr_season := !(month(date) %in% 5:8)
][
  , season_count := rleidv(gr_season)
][
  gr_season == TRUE, .(date = date, precip = cumsum(precip)), 
  by = season_count
]

Output

      season_count       date      precip
   1:            1 2000-01-01    5.310173
   2:            1 2000-01-02   12.752651
   3:            1 2000-01-03   24.209719
   4:            1 2000-01-04   42.373874
   5:            1 2000-01-05   46.407513
  ---                                    
1208:           11 2004-12-27 1101.280221
1209:           11 2004-12-28 1112.926760
1210:           11 2004-12-29 1114.345408
1211:           11 2004-12-30 1125.758288
1212:           11 2004-12-31 1136.971833

Upvotes: 4

Allan Cameron
Allan Cameron

Reputation: 173803

The end of April is 120 days from the start of the year. If you subtract 120 days from all entries in your date column to create a pseudodate column, then the pseudodate of every day of your growing period will be shifted into the previous year. Since subtracting 120 days from 1st September gives 4th May of the same year, any pseudodate before 4th May will be in a non-growing period, and anything on or after that date until the end of the year must be in the growing period that started in the September of that year. By this method, we can easily label each actual date according to whether it is in the growing period, and according to the year that growing period started.

Then all we need to do is filter out the dates that weren't in the growing period, group_by the year the growing season began and perform a cumsum on the precip column:

library(dplyr)
library(lubridate)

pr %>% mutate(pseudodate = date - days(120),
              is_growing = yday(pseudodate) > 125,
              season_beginning = year(pseudodate)) %>%
  filter(is_growing) %>%
  mutate(cum_precip = cumsum(precip)) %>%
  select(date, precip, season_beginning, cum_precip)

#>             date    precip season_beginning  cum_precip
#>    1: 2000-01-01 17.694152             1999    17.69415
#>    2: 2000-01-02  6.066319             1999    23.76047
#>    3: 2000-01-03  4.793192             1999    28.55366
#>    4: 2000-01-04 15.753112             1999    44.30678
#>    5: 2000-01-05 12.253172             1999    56.55995
#>   ---                                                  
#> 1198: 2004-12-27  8.983804             2004 11490.72677
#> 1199: 2004-12-28  6.740315             2004 11497.46709
#> 1200: 2004-12-29  3.899960             2004 11501.36705
#> 1201: 2004-12-30  6.357432             2004 11507.72448
#> 1202: 2004-12-31  3.950666             2004 11511.67515

Of course, since this example data set starts on January 1st, we have already missed a few months of the growing period for that season, so the first year would probably need removed from your actual data if it also starts in January.

Upvotes: 2

Related Questions