Remove rows based on groupby of multiple columns resulting in lowest value only

Question

I have the following data in a pandas dataframe in Python. I would like to group the data based on the Name and Property column, and keep only the lowest value within the Date.

Before:

  Name  Property                         Date
0   X1      Cash  621 days 00:00:00.000000000
1   X1      Cash  256 days 00:00:00.000000000
2   X1      Cash  101 days 00:00:00.000000000
3   X1  Overflow  352 days 00:00:00.000000000
4   X1  Overflow  101 days 00:00:00.000000000
5   X1  Overflow   53 days 00:00:00.000000000

Repeat for X2/X3 etc.

After:

  Name  Property                         Date
2   X1      Cash  101 days 00:00:00.000000000
5   X1  Overflow   53 days 00:00:00.000000000

I am attempting to play around with the syntax but unable to get it right:

df1 = df1[df1.groupby(['Name', 'Property'])['Date'].min()]

user2285236 · Accepted Answer

If you call reset intex on the groupby.min() result, you'll have your desired output:

df.groupby(['Name', 'Property'])['Date'].min().reset_index()
Out: 
  Name  Property     Date
0   X1      Cash 101 days
1   X1  Overflow  53 days

You can also use nsmallest:

df.groupby(['Name', 'Property'])['Date'].nsmallest(1)
Out: 
Name  Property   
X1    Cash      2   101 days
      Overflow  5    53 days
Name: Date, dtype: timedelta64[ns]

You can reset the index and drop the original index if you want:

df.groupby(['Name', 'Property'])['Date'].nsmallest(1).reset_index(level=[0, 1])
Out: 
  Name  Property     Date
2   X1      Cash 101 days
5   X1  Overflow  53 days

If you have additional columns you want to keep, you have several options.

df['additional column'] = np.arange(6)

df
Out: 
  Name  Property     Date  additional column
0   X1      Cash 621 days                  0
1   X1      Cash 256 days                  1
2   X1      Cash 101 days                  2
3   X1  Overflow 352 days                  3
4   X1  Overflow 101 days                  4
5   X1  Overflow  53 days                  5

The first one, as @piRSquared mentioned, is to use loc with the indexes of the returning rows:

df.loc[df.groupby(['Name', 'Property'])['Date'].idxmin()]

A second one would be to merge the original DataFrame and the aggregated one:

df.merge(df.groupby(['Name', 'Property'])['Date'].min().reset_index())

Both yield:

  Name  Property     Date  additional column
0   X1      Cash 101 days                  2
1   X1  Overflow  53 days                  5

Based on the use case, one may outperform the other.

Remove rows based on groupby of multiple columns resulting in lowest value only

Answers (2)

Related Questions