Lynn
Lynn

Reputation: 4398

Merge two datasets with conditional datetimes and show unmatched values

I have two datasets:

df1 and df2, where I would like the columns to merge together from df1, if the datetimes are within 20 seconds of df2

df1


Connect                 Ended

4/6/2020 1:15:21 PM     4/6/2020 2:05:18 PM
3/31/2020 11:00:08 AM   3/31/2020 11:00:10 AM
4/1/2020 10:00:05 PM    4/1/2020 12:00:05 PM

df2

Start                   End

4/6/2020 1:15:21 PM     4/6/2020 2:05:18 PM
3/31/2020 11:00:10 AM   3/31/2020 11:00:14 AM

Desired Output:

df3

Match_Start1             Match_End1                     Match_Start2              Match_End2

4/6/2020 1:15:21 PM     4/6/2020 2:05:18 PM          4/6/2020 1:15:21 PM     4/6/2020 2:05:18 PM
3/31/2020 11:00:08 AM   3/31/2020 11:00:10 AM        3/31/2020 11:00:10 AM   3/31/2020 11:00:14 AM    

df4 (unmatched)

Unmatched_Start         Unmatched_end  

4/1/2020 10:00:05 PM    4/1/2020 12:00:05 PM

Dput:

df1

structure(list(Connect = structure(c(4L, 2L, 3L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "3/31/2020 11:00:08 AM", 
"4/1/2020 10:00:05 PM", "4/6/2020 1:15:21 PM"), class = "factor"), 
Ended = structure(c(4L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = c("", "3/31/2020 11:00:10 AM", "4/1/2020 12:00:05 PM", 
"4/6/2020 2:05:18 PM"), class = "factor")), class = "data.frame", row.names = c(NA, 
 -13L))



df2

structure(list(Start = structure(2:1, .Label = c("3/31/2020 11:00:10 AM", 
"4/6/2020 1:15:21 PM"), class = "factor"), End = structure(2:1, .Label = c("3/31/2020 11:00:14 AM", 
"4/6/2020 2:05:18 PM"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

What I have tried:

pd.merge_asof(df1, df2, on="Connect", "Ended", by='ticker',       tolerance=pd.Timedelta('20 s'), direction='backward')

However, how do I incorporate the condition of the 20 seconds, as well as show the unmatched dataset?

Any suggestion is appreciated

Upvotes: 0

Views: 57

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388817

We can use crossing to create all combination of df1 and df2 and keep only those rows which are within 20-second interval.

library(tidyr)
library(dplyr)
library(lubridate)

df3 <- crossing(df1, df2) %>%
          mutate_all(mdy_hms) %>%
          filter(abs(difftime(Connect, Start, units = "secs")) <= 20 &
                 abs(difftime(Ended, End, units = "secs")) <= 20)

df3
# A tibble: 2 x 4
#  Connect             Ended               Start               End                
#  <dttm>              <dttm>              <dttm>              <dttm>             
#1 2020-03-31 11:00:08 2020-03-31 11:00:10 2020-03-31 11:00:10 2020-03-31 11:00:14
#2 2020-04-06 13:15:21 2020-04-06 14:05:18 2020-04-06 13:15:21 2020-04-06 14:05:18

To get df4, we can do :

df4 <- df1 %>% mutate_all(mdy_hms) %>% anti_join(df3, by = c('Connect', 'Ended'))

Upvotes: 1

Related Questions