Reputation: 141
I am currently dealing with the following data structures:
Attributes df:
ID Begin_A End_A Interval Value
1 5 1990-03-01 2017-03-10 1990-03-01 UTC--2017-03-10 UTC Cat1
2 10 1993-12-01 2017-12-02 1993-12-01 UTC--2017-12-02 UTC Cat2
3 5 1991-03-01 2017-03-03 1991-03-01 UTC--2017-03-03 UTC Cat3
4 10 1995-12-05 2017-12-10 1995-12-05 UTC--2017-12-10 UTC Cat4
Bookings df:
ID Begin_A End_A Interval
1 5 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC
2 6 2017-05-03 2017-05-05 2017-05-03 UTC--2017-05-05 UTC
3 8 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC
4 10 2017-12-05 2017-12-06 2017-12-05 UTC--2017-12-06 UTC
As already mentioned in the following post: Matching values conditioned on overlapping Intervals and ID , I intend to do the following data-restructuring: Take the ID from bookings, filter all rows of the attributes data frame where attributes ID matches the booking ID. Check which of the rows with matching attribute ID also have overlapping time intervals (int_overlaps from lubridate). Then take the respective value from the Value column and print each of them in the Attribute_value column.
The intended result would look like this:
ID Begin_A End_A Interval Attribute_value
5 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC Cat1,Cat3
6 2017-05-03 2017-05-05 2017-05-03 UTC--2017-05-05 UTC NA
8 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC NA
10 2017-12-05 2017-12-06 2017-12-05 UTC--2017-12-06 UTC Cat4
ycw already provided a partial answer to this question here:(https://stackoverflow.com/a/46819541/8259308). This solution does not allow long periods between Begin_A and End_A in the attributes data frame, because a vector with individual dates is created with this command:
complete(Date = full_seq(Date, period = 1), ID) %>%
Since my original dataset has a very large amount of observations with long time frames in the Attributes data frame, R is not capable of processing these large amount of observations. My idea was to either modify the above mentioned line to reduce the jumps in dates to months ( which would also diminish the precision) or to try a new approach. The following code produces the data frames presented above:
library(lubridate)
library(tidyverse)
# Attributes data frame:
date1 <- as.Date(c('1990-3-1','1993-12-1','1991-3-1','1995-12-5'))
date2 <- as.Date(c('2017-3-10','2017-12-2','2017-3-3','2017-12-10'))
attributes <- data.frame(matrix(NA,nrow=4, ncol = 5))
names(attributes) <- c("ID","Begin_A", "End_A", "Interval", "Value")
attributes$ID <- as.numeric(c(5,10,5,10))
attributes$Begin_A <-date1
attributes$End_A <-date2
attributes$Interval <-attributes$Begin_A %--% attributes$End_A
attributes$Value<- as.character(c("Cat1","Cat2","Cat3","Cat4"))
### Bookings data frame:
date1 <- as.Date(c('2017-3-3','2017-5-3','2017-3-3','2017-12-5'))
date2 <- as.Date(c('2017-3-5','2017-5-5','2017-3-5','2017-12-6'))
bookings <- data.frame(matrix(NA,nrow=4, ncol = 4))
names(bookings) <- c("ID","Begin_A", "End_A", "Interval")
bookings$ID <- as.numeric(c(5,6,8,10))
bookings$Begin_A <-date1
bookings$End_A <-date2
bookings$Interval <-bookings$Begin_A %--% bookings$End_A
This is the solution for the previous post provided by ycw:
library(tidyverse)
attributes2 <- attributes %>%
select(-Interval) %>%
gather(Type, Date, ends_with("_A")) %>%
select(-Type) %>%
group_by(Value) %>%
complete(Date = full_seq(Date, period = 1), ID) %>%
ungroup()
bookings2 <- bookings %>%
select(-Interval) %>%
gather(Type, Date, ends_with("_A")) %>%
select(-Type) %>%
group_by(ID) %>%
complete(Date = full_seq(Date, period = 1)) %>%
ungroup()
bookings3 <- bookings2 %>%
left_join(attributes2, by = c("ID", "Date")) %>%
group_by(ID) %>%
summarise(Attribute_value = toString(sort(unique(Value)))) %>%
mutate(Attribute_value = ifelse(Attribute_value %in% "", NA, Attribute_value))
bookings4 <- bookings %>% left_join(bookings3, by = "ID")
bookings4
ID Begin_A End_A Interval Attribute_value
1 5 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC Cat1, Cat3
2 6 2017-05-03 2017-05-05 2017-05-03 UTC--2017-05-05 UTC <NA>
3 8 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC <NA>
4 10 2017-12-05 2017-12-06 2017-12-05 UTC--2017-12-06 UTC Cat4
Upvotes: 1
Views: 287
Reputation: 67778
You may consider data.table
which allows for "non-equi joins", i.e. joins based on >=
, >
, <=
and <
. In the same call, aggregate operations may be performed on the groups in the LHS data set that each row in the RHS data set (i
) matches (by = .EACHI
).
d1[d2, on = .(id = id, end >= begin),
.(i.begin, i.end, val_str = toString(val)), by = .EACHI]
# id end i.begin i.end val_str
# 1: 5 2017-03-03 2017-03-03 2017-03-05 Cat3, Cat1
# 2: 6 2017-05-03 2017-05-03 2017-05-05 NA
# 3: 8 2017-03-03 2017-03-03 2017-03-05 NA
# 4: 10 2017-12-05 2017-12-05 2017-12-06 Cat4
Data preparation:
d1 <- data.frame(id = c(5, 10, 5, 10),
begin = as.Date(c('1990-3-1','1993-12-1','1991-3-1','1995-12-5')),
end = as.Date(c('2017-3-10','2017-12-2','2017-3-3','2017-12-10')),
val = c("Cat1", "Cat2", "Cat3", "Cat4"))
d2 <- data.frame(id = c(5, 6, 8, 10),
begin = as.Date(c('2017-3-3','2017-5-3','2017-3-3','2017-12-5')),
end = as.Date(c('2017-3-5','2017-5-5','2017-3-5','2017-12-6')))
library(data.table)
setDT(d1)
setDT(d2)
Upvotes: 1