herdnerd
herdnerd

Reputation: 23

categorize based on date ranges in R

How do I categorize each row in a large R dataframe (>2 million rows) based on date range definitions in a separate, much smaller R dataframe (12 rows)?

My large dataframe, captures, looks similar to this when called via head(captures) :

       id       date sex
1  160520 2016-11-22   1
2 1029735 2016-11-12   1
3 1885200 2016-11-05   1
4 2058366 2015-09-26   2
5 2058367 2015-09-26   1
6 2058368 2015-09-26   1

My small dataframe, seasons, looks similar to this in its entirety:

Season Opening.Date Closing.Date
  2016   2016-09-24   2017-01-15
  2015   2015-09-26   2016-01-10
  2014   2014-09-27   2015-01-11
  2013   2013-09-28   2014-01-12
  2012   2012-09-22   2013-01-13
  2011   2011-09-24   2012-01-08
  2010   2010-09-25   2011-01-16
  2009   2009-09-26   2010-01-17
  2008   2008-09-27   2009-01-18
  2007   2007-09-22   2008-01-13
  2006   2006-09-23   2007-01-14
  2005   2005-09-24   2006-01-15 

I need to add a 'season' column to my captures dataframe where the value would be determined based on if and where captures$date falls in the ranges defined in seasons.

Here is a long-hand solution I came up with that isn't working for me because my dataframe is so large.

#add packages
library(dplyr)
library(lubridate)
#make blank column
captures$season=NA
for (i in 1:length(seasons$Season)){
  for (j in 1:length(captures$id{
    captures$season[j]=ifelse(between(captures$date[j],ymd(seasons$Opening.Date[i]),ymd(seasons$Closing.Date[i])),seasons$Season[i],captures$season[j])
  }
}

Again, this doesn't work for me as R crashes every time. I also realize this doesn't take advantage of vectorization in R. Any help here is appreciated!

Upvotes: 1

Views: 1956

Answers (3)

Arun
Arun

Reputation: 118839

Here's using non equi joins from data.table:

require(data.table) # v1.10.4+
setDT(captures) # convert data.frames to data.tables
setDT(seasons)

ans <- seasons[captures, Season,
                 on=.(Opening.Date<=date, Closing.Date>=date), 
                 mult="first"]
# [1] 2016 2016 2016 2015 2015 2015
seasons[, season := ans]

For each row in captures, the index corresponding to the first matching row (mult="first") in seasons is figured out based on the condition provided to on argument. The value of Season for corresponding indices is then returned and saved under ans. It is then added as a new column to seasons by reference.

I've shown it in two steps for sake of understanding.


You can see the first matching indices by using which=TRUE instead:

seasons[captures, 
          on=.(Opening.Date<=date, Closing.Date>=date),
          mult="first", 
          which=TRUE]
# [1] 1 1 1 2 2 2

Upvotes: 2

Pierre Lapointe
Pierre Lapointe

Reputation: 16277

You could try with sqldf. Note, I had to change the point in Opening_Date and Closing_Date to an "_".

library(sqldf)

captures$season <- sqldf("select Season from seasons s, captures c
where c.date >= s.Opening_Date and c.date <= s.Closing_Date")
captures    
       id       date sex Season
1  160520 2016-11-22   1   2016
2 1029735 2016-11-12   1   2016
3 1885200 2016-11-05   1   2016
4 2058366 2015-09-26   2   2015
5 2058367 2015-09-26   1   2015
6 2058368 2015-09-26   1   2015

data

txt <- "Season Opening_Date Closing_Date
  2016   2016-09-24   2017-01-15
  2015   2015-09-26   2016-01-10
  2014   2014-09-27   2015-01-11
  2013   2013-09-28   2014-01-12
  2012   2012-09-22   2013-01-13
  2011   2011-09-24   2012-01-08
  2010   2010-09-25   2011-01-16
  2009   2009-09-26   2010-01-17
  2008   2008-09-27   2009-01-18
  2007   2007-09-22   2008-01-13
  2006   2006-09-23   2007-01-14
  2005   2005-09-24   2006-01-15"
seasons <- read.table(text = txt, header = TRUE)
seasons[2:3] <- lapply(seasons[2:3], as.Date)

txt <- "       id       date sex
1  160520 2016-11-22   1
2 1029735 2016-11-12   1
3 1885200 2016-11-05   1
4 2058366 2015-09-26   2
5 2058367 2015-09-26   1
6 2058368 2015-09-26   1"
captures <- read.table(text = txt, header = TRUE)
captures$date <- as.Date(captures$date)

Upvotes: 0

r2evans
r2evans

Reputation: 160607

It would be great indeed if you could do a join operation efficiently based on a range of values instead of equality. Unfortunately, I don't know if a general solution exists. In the time being, I suggest using a single for loop.

The efficiency of vectorization is best done along the tallest data. That is, if we loop on one data.frame and vectorize the other, it makes more sense to vectorize the longer vector and loop on the shorter ones. With this in mind, we'll loop on the frame of seasons and vectorize the 2M rows of data.

Your data:

txt <- "Season Opening.Date Closing.Date
  2016   2016-09-24   2017-01-15
  2015   2015-09-26   2016-01-10
  2014   2014-09-27   2015-01-11
  2013   2013-09-28   2014-01-12
  2012   2012-09-22   2013-01-13
  2011   2011-09-24   2012-01-08
  2010   2010-09-25   2011-01-16
  2009   2009-09-26   2010-01-17
  2008   2008-09-27   2009-01-18
  2007   2007-09-22   2008-01-13
  2006   2006-09-23   2007-01-14
  2005   2005-09-24   2006-01-15"
seasons <- read.table(text = txt, header = TRUE)
seasons[2:3] <- lapply(seasons[2:3], as.Date)

txt <- "       id       date sex
1  160520 2016-11-22   1
2 1029735 2016-11-12   1
3 1885200 2016-11-05   1
4 2058366 2015-09-26   2
5 2058367 2015-09-26   1
6 2058368 2015-09-26   1"
dat <- read.table(text = txt, header = TRUE)
dat$date <- as.Date(dat$date)

And the start the process, we assume that all data's season is as yet not defined:

dat$season <- NA

Loop around each of the seasons' rows:

for (i in seq_len(nrow(seasons))) {
  dat$season <- ifelse(is.na(dat$season) &
                         dat$date >= seasons$Opening.Date[i] &
                         dat$date < seasons$Closing.Date[i],
                       seasons$Season[i], dat$season)                       
}
dat
#        id       date sex season
# 1  160520 2016-11-22   1   2016
# 2 1029735 2016-11-12   1   2016
# 3 1885200 2016-11-05   1   2016
# 4 2058366 2015-09-26   2   2015
# 5 2058367 2015-09-26   1   2015
# 6 2058368 2015-09-26   1   2015

Upvotes: 0

Related Questions