Jdoe
Jdoe

Reputation: 89

Pandas : how to keep rows in groupby where column value = min

here's a sample of my dataset

side     |  serial_number   |   inspector  |   date_1        |    date_2
top      |       10         |   Paul       |   4/1/18 13:21  |    4/1/18 14:22
bot      |       10         |   Jack       |   4/1/18 13:01  |    4/1/18 14:22
bot      |       11         |   Jack       |   4/1/18 14:01  |    4/1/18 14:53
top      |       11         |   Paul       |   4/1/18 14:25  |    4/1/18 14:53
top      |       12         |   Henry      |   4/1/18 14:25  |    4/1/18 14:58

For each unique tuple (serial_number, date_2), I want to keep the row where date_1 is minimum and keep every column, so that eventually my dataset looks like this :

side     |  serial_number   |   inspector  |   date_1        |    date_2
bot      |       10         |   Jack       |   4/1/18 13:01  |    4/1/18 14:22
bot      |       11         |   Jack       |   4/1/18 14:01  |    4/1/18 14:53
top      |       12         |   Henry      |   4/1/18 14:25  |    4/1/18 14:58

To do so, my current code looks like this :

import pandas as pd

df = pd.read_csv("data.csv") #getting the data in a pandas dataframe
df_sorted = df.groupby(['serial_number','date_2'], sort=False)['date_1'].min()
df_sorted .to_csv("data_sorted.csv")

So in the end, I got the right dataset but columns I'm not grouping by are missing. Here's the resulting dataset :

 serial_number   |     date_1        |    date_2
      10         |     4/1/18 13:01  |    4/1/18 14:22
      11         |     4/1/18 14:01  |    4/1/18 14:53
      12         |     4/1/18 14:25  |    4/1/18 14:58

How do I keep all columns ? Thank you.

Upvotes: 4

Views: 4900

Answers (2)

cmaher
cmaher

Reputation: 5215

Instead of calling min, after your groupby, which returns the minimum value for each group, instead use idxmin, which returns the index value where the minimum occurs in each group:

df.groupby(['serial_number','date_2'])['date_1'].idxmin()

# serial_number  date_2             
# 10             2018-04-01 14:22:00    1
# 11             2018-04-01 14:53:00    2
# 12             2018-04-01 14:58:00    4

You can then use these indices with iloc to select the complete rows in your dataframe where the minimum for each group occurs:

df.iloc[df.groupby(['serial_number','date_2'])['date_1'].idxmin()]

#   side        serial_number inspector                    date_1  \
# 1  bot                   10     Jack        2018-04-01 13:01:00   
# 2  bot                   11     Jack        2018-04-01 14:01:00   
# 4  top                   12     Henry       2018-04-01 14:25:00   
# 
#                date_2  
# 1 2018-04-01 14:22:00  
# 2 2018-04-01 14:53:00  
# 4 2018-04-01 14:58:00  

Upvotes: 9

Michael L. Tanny
Michael L. Tanny

Reputation: 139

I think what you want can be achieved in 2 steps

  1. Sorting the data with earliest-to-latest date_1
  2. Perform a drop_duplicates with respect to the unique tuples

The following will be one solution:

df = pd.read_csv("data.csv")
df_sorted = df.sort(['date_1'], ascending=True)
df_sorted.drop_duplicates(subset=['serial_number','date_2'], keep='first')

Cheers!

Upvotes: 2

Related Questions