Pandas : how to keep rows in groupby where column value = min

Question

here's a sample of my dataset

side     |  serial_number   |   inspector  |   date_1        |    date_2
top      |       10         |   Paul       |   4/1/18 13:21  |    4/1/18 14:22
bot      |       10         |   Jack       |   4/1/18 13:01  |    4/1/18 14:22
bot      |       11         |   Jack       |   4/1/18 14:01  |    4/1/18 14:53
top      |       11         |   Paul       |   4/1/18 14:25  |    4/1/18 14:53
top      |       12         |   Henry      |   4/1/18 14:25  |    4/1/18 14:58

For each unique tuple (serial_number, date_2), I want to keep the row where date_1 is minimum and keep every column, so that eventually my dataset looks like this :

side     |  serial_number   |   inspector  |   date_1        |    date_2
bot      |       10         |   Jack       |   4/1/18 13:01  |    4/1/18 14:22
bot      |       11         |   Jack       |   4/1/18 14:01  |    4/1/18 14:53
top      |       12         |   Henry      |   4/1/18 14:25  |    4/1/18 14:58

To do so, my current code looks like this :

import pandas as pd

df = pd.read_csv("data.csv") #getting the data in a pandas dataframe
df_sorted = df.groupby(['serial_number','date_2'], sort=False)['date_1'].min()
df_sorted .to_csv("data_sorted.csv")

So in the end, I got the right dataset but columns I'm not grouping by are missing. Here's the resulting dataset :

 serial_number   |     date_1        |    date_2
      10         |     4/1/18 13:01  |    4/1/18 14:22
      11         |     4/1/18 14:01  |    4/1/18 14:53
      12         |     4/1/18 14:25  |    4/1/18 14:58

How do I keep all columns ? Thank you.

cmaher · Accepted Answer

Instead of calling min, after your groupby, which returns the minimum value for each group, instead use idxmin, which returns the index value where the minimum occurs in each group:

df.groupby(['serial_number','date_2'])['date_1'].idxmin()

# serial_number  date_2             
# 10             2018-04-01 14:22:00    1
# 11             2018-04-01 14:53:00    2
# 12             2018-04-01 14:58:00    4

You can then use these indices with iloc to select the complete rows in your dataframe where the minimum for each group occurs:

df.iloc[df.groupby(['serial_number','date_2'])['date_1'].idxmin()]

#   side        serial_number inspector                    date_1  \
# 1  bot                   10     Jack        2018-04-01 13:01:00   
# 2  bot                   11     Jack        2018-04-01 14:01:00   
# 4  top                   12     Henry       2018-04-01 14:25:00   
# 
#                date_2  
# 1 2018-04-01 14:22:00  
# 2 2018-04-01 14:53:00  
# 4 2018-04-01 14:58:00

Pandas : how to keep rows in groupby where column value = min

Answers (2)

Related Questions