Reputation: 89
here's a sample of my dataset
side | serial_number | inspector | date_1 | date_2
top | 10 | Paul | 4/1/18 13:21 | 4/1/18 14:22
bot | 10 | Jack | 4/1/18 13:01 | 4/1/18 14:22
bot | 11 | Jack | 4/1/18 14:01 | 4/1/18 14:53
top | 11 | Paul | 4/1/18 14:25 | 4/1/18 14:53
top | 12 | Henry | 4/1/18 14:25 | 4/1/18 14:58
For each unique tuple (serial_number, date_2), I want to keep the row where date_1 is minimum and keep every column, so that eventually my dataset looks like this :
side | serial_number | inspector | date_1 | date_2
bot | 10 | Jack | 4/1/18 13:01 | 4/1/18 14:22
bot | 11 | Jack | 4/1/18 14:01 | 4/1/18 14:53
top | 12 | Henry | 4/1/18 14:25 | 4/1/18 14:58
To do so, my current code looks like this :
import pandas as pd
df = pd.read_csv("data.csv") #getting the data in a pandas dataframe
df_sorted = df.groupby(['serial_number','date_2'], sort=False)['date_1'].min()
df_sorted .to_csv("data_sorted.csv")
So in the end, I got the right dataset but columns I'm not grouping by are missing. Here's the resulting dataset :
serial_number | date_1 | date_2
10 | 4/1/18 13:01 | 4/1/18 14:22
11 | 4/1/18 14:01 | 4/1/18 14:53
12 | 4/1/18 14:25 | 4/1/18 14:58
How do I keep all columns ? Thank you.
Upvotes: 4
Views: 4900
Reputation: 5215
Instead of calling min
, after your groupby
, which returns the minimum value for each group, instead use idxmin
, which returns the index value where the minimum occurs in each group:
df.groupby(['serial_number','date_2'])['date_1'].idxmin()
# serial_number date_2
# 10 2018-04-01 14:22:00 1
# 11 2018-04-01 14:53:00 2
# 12 2018-04-01 14:58:00 4
You can then use these indices with iloc
to select the complete rows in your dataframe where the minimum for each group occurs:
df.iloc[df.groupby(['serial_number','date_2'])['date_1'].idxmin()]
# side serial_number inspector date_1 \
# 1 bot 10 Jack 2018-04-01 13:01:00
# 2 bot 11 Jack 2018-04-01 14:01:00
# 4 top 12 Henry 2018-04-01 14:25:00
#
# date_2
# 1 2018-04-01 14:22:00
# 2 2018-04-01 14:53:00
# 4 2018-04-01 14:58:00
Upvotes: 9
Reputation: 139
I think what you want can be achieved in 2 steps
date_1
drop_duplicates
with respect to the unique tuplesThe following will be one solution:
df = pd.read_csv("data.csv")
df_sorted = df.sort(['date_1'], ascending=True)
df_sorted.drop_duplicates(subset=['serial_number','date_2'], keep='first')
Cheers!
Upvotes: 2