Reputation: 71
How might I categorize each row in an R dataframe (>10 000 rows) based on date range definitions in a separate, much smaller R dataframe (62 rows)?
My large dataframe, ConcFlow, looks similar to this when called via head(ConcFlow) :
STATION Sampling.Year DATE Flow
1 2016-2017 13/03/2017 177.45
1 2016-2017 12/01/2017 96.798
1 2016-2017 11/01/2017 99.902
2 2016-2017 4/03/2017 109.74
2 2016-2017 5/03/2017 100.55
3 2016-2017 19/05/2017 2302.5
1 2017-2018 13/03/2018 177.45
1 2017-2018 12/01/2018 96.798
1 2017-2018 11/01/2018 99.902
2 2017-2018 4/03/2018 109.74
2 2017-2018 5/03/2018 100.55
3 2017-2018 19/05/2018 2302.5
The smaller dataset, First.Flush, contains dates for the start and end of the Australian wet season, like this:
STATION Sampling.Year Start End Season
1 2011-2012 1/01/2012 1/07/2012 Wet Season
1 2013-2014 1/01/2014 2/07/2014 Wet Season
1 2014-2015 1/01/2015 2/07/2015 Wet Season
1 2015-2016 23/12/2015 22/06/2016 Wet Season
1 2016-2017 12/12/2016 12/06/2017 Wet Season
2 2011-2012 18/10/2011 17/04/2012 Wet Season
2 2012-2013 24/12/2012 24/06/2013 Wet Season
2 2013-2014 1/01/2014 2/07/2014 Wet Season
2 2014-2015 1/01/2015 2/07/2015 Wet Season
2 2015-2016 23/12/2015 22/06/2016 Wet Season
3 2011-2012 18/10/2011 17/04/2012 Wet Season
I need to add a 'season' column to my ConcFlow dataframe where the value would be determined based on whether the ConcFlow$DATE falls in the ranges defined in First.Flush. If the DATE is within First.Flush$Start and First.Flush$End (inclusive) it needs to be defined as Wet Season. If not, it should be defined as Dry Season.
It also needs to repeat for each ConcFLow$STATION and ConcFLow$Sampling.Year
The best I've been able to produce is a for loop, but I don't know how to make it repeat for each STATION and Sampling.Year. I'm a novice in R and loops don't make too much sense to me yet. I don't know if this is the best approach - any help would be much appreciated thank you!
for (i in seq_len(nrow(First.Flush$STATION))) {
ConcFlow$Season <- ifelse(is.na(ConcFlow$Season) &
ConcFlow$DATE >= First.Flush$Start[i] &
ConcFlow$DATE < First.Flush$End[i],
First.Flush$Season[i], ConcFlow$Season)
}
This is a similar question but I don't know how to apply it to multiple factor levels within my df (ConcFlow$Station and ConFlow$Sampling.Year). categorize based on date ranges in R
Thank you
Upvotes: 0
Views: 114
Reputation: 389135
You can use fuzzyjoin::fuzzy_left_join
to join data based on range.
library(dplyr)
ConcFlow <- ConcFlow %>% mutate(DATE = as.Date(DATE, '%d/%m/%Y'))
First.Flush <- First.Flush %>% mutate(across(c(Start, End), as.Date, '%d/%m/%Y'))
ConcFlow %>%
fuzzyjoin::fuzzy_left_join(First.Flush,
by = c('STATION', 'Sampling.Year', 'DATE' = 'Start', 'DATE' = 'End'),
match_fun = c(`==`, `==`, `>=`, `<=`)) %>%
mutate(Season = replace(Season, is.na(Season), 'Dry Season')) -> result
result
First convert the date columns to date class in both the datasets, find out the rows where STATION
and Sampling.Year
matches exactly and DATE
is in between Start
and End
, those rows automatically get Season
value from First.Flush
. The rows which are out of range get NA
which can be replaced with 'Dry Season'.
You might need to clean up some column names in result
at the end.
Upvotes: 1