categorise df based on date ranges, with multiple factor levels

Question

How might I categorize each row in an R dataframe (>10 000 rows) based on date range definitions in a separate, much smaller R dataframe (62 rows)?

My large dataframe, ConcFlow, looks similar to this when called via head(ConcFlow) :

    STATION Sampling.Year   DATE    Flow
1   2016-2017   13/03/2017  177.45
1   2016-2017   12/01/2017  96.798
1   2016-2017   11/01/2017  99.902
2   2016-2017   4/03/2017   109.74
2   2016-2017   5/03/2017   100.55
3   2016-2017   19/05/2017  2302.5
1   2017-2018   13/03/2018  177.45
1   2017-2018   12/01/2018  96.798
1   2017-2018   11/01/2018  99.902
2   2017-2018   4/03/2018   109.74
2   2017-2018   5/03/2018   100.55
3   2017-2018   19/05/2018  2302.5

The smaller dataset, First.Flush, contains dates for the start and end of the Australian wet season, like this:

  STATION   Sampling.Year   Start   End Season
1   2011-2012   1/01/2012   1/07/2012   Wet Season
1   2013-2014   1/01/2014   2/07/2014   Wet Season
1   2014-2015   1/01/2015   2/07/2015   Wet Season
1   2015-2016   23/12/2015  22/06/2016  Wet Season
1   2016-2017   12/12/2016  12/06/2017  Wet Season
2   2011-2012   18/10/2011  17/04/2012  Wet Season
2   2012-2013   24/12/2012  24/06/2013  Wet Season
2   2013-2014   1/01/2014   2/07/2014   Wet Season
2   2014-2015   1/01/2015   2/07/2015   Wet Season
2   2015-2016   23/12/2015  22/06/2016  Wet Season
3   2011-2012   18/10/2011  17/04/2012  Wet Season

I need to add a 'season' column to my ConcFlow dataframe where the value would be determined based on whether the ConcFlow$DATE falls in the ranges defined in First.Flush. If the DATE is within First.Flush$Start and First.Flush$End (inclusive) it needs to be defined as Wet Season. If not, it should be defined as Dry Season.

It also needs to repeat for each ConcFLow$STATION and ConcFLow$Sampling.Year

The best I've been able to produce is a for loop, but I don't know how to make it repeat for each STATION and Sampling.Year. I'm a novice in R and loops don't make too much sense to me yet. I don't know if this is the best approach - any help would be much appreciated thank you!

for (i in seq_len(nrow(First.Flush$STATION))) {
  ConcFlow$Season <- ifelse(is.na(ConcFlow$Season) &
                              ConcFlow$DATE >= First.Flush$Start[i] &
                              ConcFlow$DATE < First.Flush$End[i],
                            First.Flush$Season[i], ConcFlow$Season)                       
}

This is a similar question but I don't know how to apply it to multiple factor levels within my df (ConcFlow$Station and ConFlow$Sampling.Year). categorize based on date ranges in R

Thank you

Ronak Shah · Accepted Answer

You can use fuzzyjoin::fuzzy_left_join to join data based on range.

library(dplyr)

ConcFlow <- ConcFlow %>% mutate(DATE = as.Date(DATE, '%d/%m/%Y'))
First.Flush <- First.Flush %>% mutate(across(c(Start, End), as.Date, '%d/%m/%Y'))

ConcFlow %>%
  fuzzyjoin::fuzzy_left_join(First.Flush, 
          by = c('STATION', 'Sampling.Year', 'DATE' = 'Start', 'DATE' = 'End'), 
          match_fun = c(`==`, `==`, `>=`, `<=`)) %>%
  mutate(Season = replace(Season, is.na(Season), 'Dry Season')) -> result

result

First convert the date columns to date class in both the datasets, find out the rows where STATION and Sampling.Year matches exactly and DATE is in between Start and End, those rows automatically get Season value from First.Flush. The rows which are out of range get NA which can be replaced with 'Dry Season'.

You might need to clean up some column names in result at the end.

categorise df based on date ranges, with multiple factor levels

Answers (1)

Related Questions