Left join DataFrame where the Date in the left DataFrame is contained in the range of Dates based around a Date in the right DataFrame

Question

import pandas as pd
df_A = pd.DataFrame({'Team_A': ['Cowboys', 'Giants'], 'Team_B': ['Eagles', 'Redskins'], 'Start':['2017-11-09','2017-09-10']})
df_B = pd.DataFrame({'Team_A': ['Cowboys', 'Cowboys', 'Giants'], 'Team_B': ['Eagles', 'Eagles','Redskins'], 'Start':['2017-11-09','2017-11-11','2017-09-10']})

df_A['Start'] = pd.to_datetime(df_A.Start)
df_B['Start'] = pd.to_datetime(df_B.Start)

I want to left join on df A. The trouble is that the games may be repeated in df_B usually with a slightly different date, no more than +- 4 days from the correct date (the one listed in df A). In the example shown the first game in df A is shown twice: first with the correct date, the second time with an incorrect date. It it not necessarily the case that the first date will be the correct one. It is also possible that more that one incorrect dates may be shown so a game may appear more than twice. Please note also that the example above is simplified in the actual problem there are several other columns which may or may not match. The other key point is that these teams will appear again several times in the real problem but at dates much further that +- 4 days.

df_merge = pd.merge(df_A, df_B, on=['Team_A', 'Team_B', 'Start'], how='left')

This is close to what I want but only gives the games where the Start dates match exactly. I also want the games that are within +- 4 days of the Start date.

Merging two dataframes based on a date between two other dates without a common column

This tackles a similar problem but in my case the number of rows in each DataFrame are different so it won't work for me.

I also tried this one but could not get it to work for me:

How to join two table in pandas based on time with delay

I also tried:

a = df_A['Start'] - pd.Timedelta(4, unit='d')
b = df_A['Start'] + pd.Timedelta(4, unit='d')
df = db_B[db_B['Start'].between(a, b, inclusive=False)]

but again this does not work because of the differing number of rows in each DataFrame.

rpanai · Accepted Answer

IIUC you would rather use outer merge as in the following example

import pandas as pd
df_A = pd.DataFrame({'Team_A': ['Cowboys', 'Giants'], 'Team_B': ['Eagles', 'Redskins'], 'Start':['2017-11-09','2017-09-10']})
df_B = pd.DataFrame({'Team_A': ['Cowboys', 'Cowboys', 'Giants'], 'Team_B': ['Eagles', 'Eagles','Redskins'], 'Start':['2017-11-09','2017-11-11','2017-09-10']})

df_A['Start'] = pd.to_datetime(df_A.Start)
df_B['Start'] = pd.to_datetime(df_B.Start)
# +/- 4 days
df_A["lower"] = df_A["Start"]- pd.Timedelta(4, unit='d')
df_A["upper"] = df_A["Start"] + pd.Timedelta(4, unit='d')
# Get rid of Start col
df_A = df_A.drop("Start", axis=1)
# outer merge on Team_A, Team_B only
df = pd.merge(df_A, df_B, on=['Team_A', 'Team_B'], how='outer')
# filter
df = df[df["Start"].between(df["lower"], df["upper"])].reset_index(drop=True)

If your dataframe is huge you might consider using dask.

Left join DataFrame where the Date in the left DataFrame is contained in the range of Dates based around a Date in the right DataFrame

Answers (1)

Related Questions