Reputation: 6874
I want to merge two dataframes on the basis of the unique number and the date matching within +/-7 days
df1
Number Report DateDone
1 some words 13/1/2021
1 more stuff 21/8/2021
44 balbla 11/4/2020
2 gobbledy bla 01/03/2019
44 rara rasputin 13/10/2021
44 tree frogs 11/10/2010
df2
Number Report DateDone
1 hocum poklum 11/1/2021
1 mjimmeny cricket 21/8/2021
44 it wasnt me 11/2/2020
2 its not really 6/03/2019
44 im innocent 12/10/2021
44 bullfrogs 11/01/2010
Number.df1 Report.df1 DateDone.df1 Number.df2 Report.df2 DateDone.df2
1 some words 13/1/2021 1 hocum poklum 11/1/2021
1 more stuff 21/8/2021 1 jimmeny cricket 21/8/2021
2 gobbledy bla 01/03/2019 2 its not really 6/03/2019
44 rara rasputin 13/10/2021 44 im innocent 12/10/2021
I was going to use a sql merge similar to one I found here but I am having difficulty knowing how to merge on the number and a date range. Do I need to calculate the 7 days before and after the DateDone in df1? Surely there is a more efficient way than having to calculate two new columns first?
qry = '''
select
df1.DateDone_start TermStart,
df1.DateDone_end TermEnd,
df2.DateDone df2Start,
df1.Number,
df2.Number
from
df1 join df2 on
date between df1.DateDone_start and df1.DateDone_end join df1 on
df1.Number = df2.Number
'''
df = pd.read_sql_query(qry, conn)
Upvotes: 1
Views: 387
Reputation: 23217
You can use .merge()
on Number
and then filter using .loc
the condition where DateDone.df2
is .between()
DateDone.df1
+/- 7 days by using +/-pd.DateOffset(days=7)
, as follows:
df1['DateDone'] = pd.to_datetime(df1['DateDone'], dayfirst=True)
df2['DateDone'] = pd.to_datetime(df2['DateDone'], dayfirst=True)
df_merge = df1.merge(df2, on='Number', suffixes=('.df1', '.df2'))
result = df_merge.loc[
df_merge['DateDone.df2'].between(
df_merge['DateDone.df1'] - pd.DateOffset(days=7),
df_merge['DateDone.df1'] + pd.DateOffset(days=7))]
Result:
print(result)
Number Report.df1 DateDone.df1 Report.df2 DateDone.df2
0 1 some words 2021-01-13 hocum poklum 2021-01-11
3 1 more stuff 2021-08-21 mjimmeny cricket 2021-08-21
8 44 rara rasputin 2021-10-13 im innocent 2021-10-12
13 2 gobbledy bla 2019-03-01 its not really 2019-03-06
Upvotes: 3
Reputation: 35636
Try merge
then filter out rows that are within 7 days:
new_df = df1.merge(df2, on='Number', suffixes=('.df1', '.df2'))
new_df = new_df[
abs(new_df['DateDone.df1'] - new_df['DateDone.df2']) <= pd.Timedelta(days=7)
]
new_df
:
Number Report.df1 DateDone.df1 Report.df2 DateDone.df2
0 1 some words 2021-01-13 hocum poklum 2021-01-11
3 1 more stuff 2021-08-21 mjimmeny cricket 2021-08-21
8 44 rara rasputin 2021-10-13 im innocent 2021-10-12
13 2 gobbledy bla 2019-03-01 its not really 2019-03-06
Convert to 'DateDone' to DateTime for both Frames if not already done:
df1['DateDone'] = pd.to_datetime(df1['DateDone'], format='%d/%m/%Y')
df2['DateDone'] = pd.to_datetime(df2['DateDone'], format='%d/%m/%Y')
Get the duration between two datetimes
new_df['DateDone.df1'] - new_df['DateDone.df2']
0 2 days
1 -220 days
2 222 days
3 0 days
4 60 days
5 -549 days
6 3743 days
7 610 days
8 1 days
9 4293 days
10 -3410 days
11 -4019 days
12 273 days
13 -5 days
dtype: timedelta64[ns]
Apply abs
to remove directionality from durations and compare to the desired duration:
abs(new_df['DateDone.df1'] - new_df['DateDone.df2']) <= pd.Timedelta(days=7)
Use this index to determine which rows to keep:
0 True
1 False
2 False
3 True
4 False
5 False
6 False
7 False
8 True
9 False
10 False
11 False
12 False
13 True
dtype: bool
Upvotes: 0