Hedgehog
Hedgehog

Reputation: 5657

dplyr: left_join where df A value lies between df B values

I'd like to know if it is possible to achieve the following using dplyr, or some tidyverse package...

Context: I am having trouble getting my data into a structure that will allow the use of geom_rect. See this SO question for the motivation.

library(tis)

# Prepare NBER recession start end dates.
recessions <- data.frame(start = as.Date(as.character(nberDates()[,"Start"]),"%Y%m%d"),
                    end= as.Date(as.character(nberDates()[,"End"]),"%Y%m%d"))

dt <- tibble(date=c(as.Date('1983-01-01'),as.Date('1990-10-15'), as.Date('1993-01-01')))

Desired output:

date       start      end
1983-01-01 NA         NA
1990-10-15 1990-08-01 1991-03-31
1993-01-01 NA         NA

Appreciate any suggestions.

Note: Previous questions indicate that sqldf is one approach to take. However, the data here involves dates and my understanding date is not a data type in SQLite.

In the spirit of 'write the code you wish you had':

df <- dt %>%
      left_join(x=., y=recessions, date >= start & date <= end)

Upvotes: 0

Views: 100

Answers (2)

Hedgehog
Hedgehog

Reputation: 5657

The following uses only dplyr and produces the desired data frame result. Note: On larger datasets you will likely run into memory issues and the sqldf proposed by G. Grothendieck will work.

Hat-tip: @nick-criswell for directing me to @ian-gow for this partial solution

# Build data frame of dates within the interval [start, end]
df1 <- dt %>% 
        mutate(dummy=TRUE) %>% 
        left_join(recessions %>% mutate(dummy=TRUE)) %>% 
        filter(date >= start & date <= end) %>% 
        select(-dummy) 

# Build data frame of all other dates with start=NA and end=NA
df2 <- dt %>% 
        mutate(dummy=TRUE) %>% 
        left_join(recessions %>% mutate(dummy=TRUE)) %>% 
        mutate(start=NA, end=NA) %>%
        unique() %>%
        select(-dummy) 
# Now merge the two.  Overwirte NA values with start and end dates
df <- df2 %>% 
      left_join(x=., y=df1, by="date") %>%
      mutate(date, start = ifelse(is.na(start.y), as.character(start.x), as.character(start.y)),end = ifelse(is.na(end.y), as.character(end.x), as.character(end.y))) %>%
      mutate(start=as.Date(start), end=as.Date(end) )

> df
# A tibble: 3 x 3
        date      start        end
      <date>     <date>     <date>
1 1983-01-01         NA         NA
2 1990-10-15 1990-08-01 1991-03-31
3 1993-01-01         NA         NA

Upvotes: 1

G. Grothendieck
G. Grothendieck

Reputation: 269576

"Date" class objects in R are internally stored as the number of days since the Epoch (January 1, 1970) and that number is what is sent to SQLite so the order is still maintained even though the class is not; therefore, we can do this using the SQLite back end:

sqldf("select * from dt left join recessions on date between start and end")

giving:

        date      start        end
1 1983-01-01       <NA>       <NA>
2 1990-10-15 1990-08-01 1991-03-31
3 1993-01-01       <NA>       <NA>

Also note that sqldf works with several other back ends that do fully support dates so you are not restricted to SQLite. Suggest you review the FAQ and Examples at https://github.com/ggrothendieck/sqldf .

Upvotes: 1

Related Questions