Reputation: 143
I have the following dataframe:
> df = pd.DataFrame( columns = ['Name','Change Date','Final Date'])
> df['Name'] = ['Alexandra','Alexandra','Alexandra','Alexandra','Bobby','Bobby']
> df['Change Date'] =['2019-04-12','2019-04-28','2019-05-21','2019-05-30','2019-03-11','2019-03-27']
> df['Final Date'] =['2019-04-15','2019-04-15','2019-05-27','2019-05-27','2019-03-20','2019-03-20']
I want to drop all duplicates yet only keep the row whereby the change date is closest to each final date in order to give the following dataframe:
> df = pd.DataFrame( columns = ['Name','Change Date','Final Date'])
> df['Name'] = ['Alexandra','Alexandra','Bobby']
> df['Change Date'] =['2019-04-12','2019-05-30','2019-03-27']
> df['Final Date'] =['2019-04-15','2019-05-27','2019-03-20']
Upvotes: 1
Views: 254
Reputation: 862581
Convert both columns to datetimes, subtract by Series.sub
and get absolute values by Series.abs
. Last get indices of minimal values per groups with DataFrameGroupBy.idxmin
and select original rows by DataFrame.loc
:
df['Final Date'] = pd.to_datetime(df['Final Date'])
df['Change Date'] = pd.to_datetime(df['Change Date'])
df['diff'] = df['Final Date'].sub(df['Change Date']).abs()
df1 = df.loc[df.groupby(['Name','Final Date'])['diff'].idxmin()]
print (df1)
Name Change Date Final Date diff
0 Alexandra 2019-04-12 2019-04-15 3 days
3 Alexandra 2019-05-30 2019-05-27 3 days
5 Bobby 2019-03-27 2019-03-20 7 days
If possible duplicated minimal values per groups use:
df1 = df[df.groupby(['Name','Final Date'])['diff'].transform('min').eq(df['diff'])]
Or if need grouping by Name
only column and seelct both minimal 3 days
values create Series with GroupBy.transform
and min
and compare by diff
, last filter by boolean indexing
:
df1 = df[df.groupby('Name')['diff'].transform('min').eq(df['diff'])]
print (df1)
Name Change Date Final Date diff
0 Alexandra 2019-04-12 2019-04-15 3 days
3 Alexandra 2019-05-30 2019-05-27 3 days
5 Bobby 2019-03-27 2019-03-20 7 days
Upvotes: 1