How to drop duplicates based on specific condition regarding how close the dates are using python?

Question

I have the following dataframe:

> df = pd.DataFrame( columns = ['Name','Change Date','Final Date']) 
> df['Name'] = ['Alexandra','Alexandra','Alexandra','Alexandra','Bobby','Bobby']
> df['Change Date'] =['2019-04-12','2019-04-28','2019-05-21','2019-05-30','2019-03-11','2019-03-27'] 
> df['Final Date'] =['2019-04-15','2019-04-15','2019-05-27','2019-05-27','2019-03-20','2019-03-20']

I want to drop all duplicates yet only keep the row whereby the change date is closest to each final date in order to give the following dataframe:

> df = pd.DataFrame( columns = ['Name','Change Date','Final Date']) 
> df['Name'] = ['Alexandra','Alexandra','Bobby'] 
> df['Change Date'] =['2019-04-12','2019-05-30','2019-03-27']
> df['Final Date'] =['2019-04-15','2019-05-27','2019-03-20']

jezrael · Accepted Answer

Convert both columns to datetimes, subtract by Series.sub and get absolute values by Series.abs. Last get indices of minimal values per groups with DataFrameGroupBy.idxmin and select original rows by DataFrame.loc:

df['Final Date'] = pd.to_datetime(df['Final Date'])
df['Change Date'] = pd.to_datetime(df['Change Date'])
df['diff'] = df['Final Date'].sub(df['Change Date']).abs()

df1 = df.loc[df.groupby(['Name','Final Date'])['diff'].idxmin()]
print (df1)
        Name Change Date Final Date   diff
0  Alexandra  2019-04-12 2019-04-15 3 days
3  Alexandra  2019-05-30 2019-05-27 3 days
5      Bobby  2019-03-27 2019-03-20 7 days

If possible duplicated minimal values per groups use:

df1 = df[df.groupby(['Name','Final Date'])['diff'].transform('min').eq(df['diff'])]

Or if need grouping by Name only column and seelct both minimal 3 days values create Series with GroupBy.transform and min and compare by diff, last filter by boolean indexing:

df1 = df[df.groupby('Name')['diff'].transform('min').eq(df['diff'])]
print (df1)
        Name Change Date Final Date   diff
0  Alexandra  2019-04-12 2019-04-15 3 days
3  Alexandra  2019-05-30 2019-05-27 3 days
5      Bobby  2019-03-27 2019-03-20 7 days

How to drop duplicates based on specific condition regarding how close the dates are using python?

Answers (1)

Related Questions