Reputation: 4398
I have two datasets:
df1 and df2, where I would like the columns to merge together from df1, if the datetimes are within 20 seconds of df2
df1
Connect Ended
4/6/2020 1:15:21 PM 4/6/2020 2:05:18 PM
3/31/2020 11:00:08 AM 3/31/2020 11:00:10 AM
4/1/2020 10:00:05 PM 4/1/2020 12:00:05 PM
df2
Start End
4/6/2020 1:15:21 PM 4/6/2020 2:05:18 PM
3/31/2020 11:00:10 AM 3/31/2020 11:00:14 AM
Desired Output:
df3
Match_Start1 Match_End1 Match_Start2 Match_End2
4/6/2020 1:15:21 PM 4/6/2020 2:05:18 PM 4/6/2020 1:15:21 PM 4/6/2020 2:05:18 PM
3/31/2020 11:00:08 AM 3/31/2020 11:00:10 AM 3/31/2020 11:00:10 AM 3/31/2020 11:00:14 AM
df4 (unmatched)
Unmatched_Start Unmatched_end
4/1/2020 10:00:05 PM 4/1/2020 12:00:05 PM
Dput:
df1
structure(list(Connect = structure(c(4L, 2L, 3L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "3/31/2020 11:00:08 AM",
"4/1/2020 10:00:05 PM", "4/6/2020 1:15:21 PM"), class = "factor"),
Ended = structure(c(4L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("", "3/31/2020 11:00:10 AM", "4/1/2020 12:00:05 PM",
"4/6/2020 2:05:18 PM"), class = "factor")), class = "data.frame", row.names = c(NA,
-13L))
df2
structure(list(Start = structure(2:1, .Label = c("3/31/2020 11:00:10 AM",
"4/6/2020 1:15:21 PM"), class = "factor"), End = structure(2:1, .Label = c("3/31/2020 11:00:14 AM",
"4/6/2020 2:05:18 PM"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
What I have tried:
pd.merge_asof(df1, df2, on="Connect", "Ended", by='ticker', tolerance=pd.Timedelta('20 s'), direction='backward')
However, how do I incorporate the condition of the 20 seconds, as well as show the unmatched dataset?
Any suggestion is appreciated
Upvotes: 0
Views: 57
Reputation: 388817
We can use crossing
to create all combination of df1
and df2
and keep only those rows which are within 20-second interval.
library(tidyr)
library(dplyr)
library(lubridate)
df3 <- crossing(df1, df2) %>%
mutate_all(mdy_hms) %>%
filter(abs(difftime(Connect, Start, units = "secs")) <= 20 &
abs(difftime(Ended, End, units = "secs")) <= 20)
df3
# A tibble: 2 x 4
# Connect Ended Start End
# <dttm> <dttm> <dttm> <dttm>
#1 2020-03-31 11:00:08 2020-03-31 11:00:10 2020-03-31 11:00:10 2020-03-31 11:00:14
#2 2020-04-06 13:15:21 2020-04-06 14:05:18 2020-04-06 13:15:21 2020-04-06 14:05:18
To get df4
, we can do :
df4 <- df1 %>% mutate_all(mdy_hms) %>% anti_join(df3, by = c('Connect', 'Ended'))
Upvotes: 1