Joshua Zecha
Joshua Zecha

Reputation: 141

Matching values between data frames based on overlapping dates

I am currently dealing with the following data structures:

Attributes df:

  ID Begin_A      End_A        Interval                          Value
1  5 1990-03-01   2017-03-10   1990-03-01 UTC--2017-03-10 UTC    Cat1
2 10 1993-12-01   2017-12-02   1993-12-01 UTC--2017-12-02 UTC    Cat2
3  5 1991-03-01   2017-03-03   1991-03-01 UTC--2017-03-03 UTC    Cat3
4 10 1995-12-05   2017-12-10   1995-12-05 UTC--2017-12-10 UTC    Cat4

Bookings df:

  ID    Begin_A      End_A                       Interval
1  5 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC
2  6 2017-05-03 2017-05-05 2017-05-03 UTC--2017-05-05 UTC
3  8 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC
4 10 2017-12-05 2017-12-06 2017-12-05 UTC--2017-12-06 UTC

As already mentioned in the following post: Matching values conditioned on overlapping Intervals and ID , I intend to do the following data-restructuring: Take the ID from bookings, filter all rows of the attributes data frame where attributes ID matches the booking ID. Check which of the rows with matching attribute ID also have overlapping time intervals (int_overlaps from lubridate). Then take the respective value from the Value column and print each of them in the Attribute_value column.

The intended result would look like this:

ID  Begin_A      End_A        Interval                        Attribute_value
5   2017-03-03   2017-03-05   2017-03-03 UTC--2017-03-05 UTC  Cat1,Cat3
6   2017-05-03   2017-05-05   2017-05-03 UTC--2017-05-05 UTC  NA
8   2017-03-03   2017-03-05   2017-03-03 UTC--2017-03-05 UTC  NA
10  2017-12-05   2017-12-06   2017-12-05 UTC--2017-12-06 UTC  Cat4

ycw already provided a partial answer to this question here:(https://stackoverflow.com/a/46819541/8259308). This solution does not allow long periods between Begin_A and End_A in the attributes data frame, because a vector with individual dates is created with this command:

complete(Date = full_seq(Date, period = 1), ID) %>%

Since my original dataset has a very large amount of observations with long time frames in the Attributes data frame, R is not capable of processing these large amount of observations. My idea was to either modify the above mentioned line to reduce the jumps in dates to months ( which would also diminish the precision) or to try a new approach. The following code produces the data frames presented above:

library(lubridate)
library(tidyverse)
# Attributes data frame:
date1 <- as.Date(c('1990-3-1','1993-12-1','1991-3-1','1995-12-5'))
date2 <- as.Date(c('2017-3-10','2017-12-2','2017-3-3','2017-12-10'))
attributes <- data.frame(matrix(NA,nrow=4, ncol = 5)) 
names(attributes) <- c("ID","Begin_A", "End_A", "Interval", "Value")
attributes$ID <- as.numeric(c(5,10,5,10))
attributes$Begin_A <-date1
attributes$End_A <-date2
attributes$Interval <-attributes$Begin_A %--% attributes$End_A
attributes$Value<- as.character(c("Cat1","Cat2","Cat3","Cat4"))

### Bookings data frame:

date1 <- as.Date(c('2017-3-3','2017-5-3','2017-3-3','2017-12-5'))
date2 <- as.Date(c('2017-3-5','2017-5-5','2017-3-5','2017-12-6'))
bookings <- data.frame(matrix(NA,nrow=4, ncol = 4)) 
names(bookings) <- c("ID","Begin_A", "End_A", "Interval")
bookings$ID <- as.numeric(c(5,6,8,10))
bookings$Begin_A <-date1
bookings$End_A <-date2
bookings$Interval <-bookings$Begin_A %--% bookings$End_A

This is the solution for the previous post provided by ycw:

library(tidyverse)

attributes2 <- attributes %>%
  select(-Interval) %>%
  gather(Type, Date, ends_with("_A")) %>%
  select(-Type) %>%
  group_by(Value) %>%
  complete(Date = full_seq(Date, period = 1), ID) %>%
  ungroup()

bookings2 <- bookings %>%
  select(-Interval) %>%
  gather(Type, Date, ends_with("_A")) %>%
  select(-Type) %>%
  group_by(ID) %>%
  complete(Date = full_seq(Date, period = 1)) %>%
  ungroup()

bookings3 <- bookings2 %>%
  left_join(attributes2, by = c("ID", "Date")) %>%
  group_by(ID) %>%
  summarise(Attribute_value = toString(sort(unique(Value)))) %>%
  mutate(Attribute_value = ifelse(Attribute_value %in% "", NA, Attribute_value))

bookings4 <- bookings %>% left_join(bookings3, by = "ID")
bookings4
  ID    Begin_A      End_A                       Interval Attribute_value
1  5 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC      Cat1, Cat3
2  6 2017-05-03 2017-05-05 2017-05-03 UTC--2017-05-05 UTC            <NA>
3  8 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC            <NA>
4 10 2017-12-05 2017-12-06 2017-12-05 UTC--2017-12-06 UTC            Cat4

Upvotes: 1

Views: 287

Answers (1)

Henrik
Henrik

Reputation: 67778

You may consider data.table which allows for "non-equi joins", i.e. joins based on >=, >, <= and <. In the same call, aggregate operations may be performed on the groups in the LHS data set that each row in the RHS data set (i) matches (by = .EACHI).

d1[d2, on = .(id = id, end >= begin),
         .(i.begin, i.end, val_str = toString(val)), by = .EACHI]

#    id        end    i.begin      i.end    val_str
# 1:  5 2017-03-03 2017-03-03 2017-03-05 Cat3, Cat1
# 2:  6 2017-05-03 2017-05-03 2017-05-05         NA
# 3:  8 2017-03-03 2017-03-03 2017-03-05         NA
# 4: 10 2017-12-05 2017-12-05 2017-12-06       Cat4

Data preparation:

d1 <- data.frame(id = c(5, 10, 5, 10),
                 begin = as.Date(c('1990-3-1','1993-12-1','1991-3-1','1995-12-5')),
                 end = as.Date(c('2017-3-10','2017-12-2','2017-3-3','2017-12-10')),
                 val = c("Cat1", "Cat2", "Cat3", "Cat4"))

d2 <- data.frame(id = c(5, 6, 8, 10),
                 begin = as.Date(c('2017-3-3','2017-5-3','2017-3-3','2017-12-5')),
                 end = as.Date(c('2017-3-5','2017-5-5','2017-3-5','2017-12-6')))

library(data.table)
setDT(d1)
setDT(d2)

Upvotes: 1

Related Questions