Reputation: 101
Here is my simplified example df:
salesPerson customer measure timeStamp
--------------------------------------
A 123 I 12:30
A 123 II 12:30
A 123 III 12:30
B 123 IV 12:35
C 456 I 14:30
C 456 II 14:30
D 456 III 14:15
What I want to do, it to filer the dataframe and in cases when 2 different salesPerson Id's have the same customer number, keep all the rows of the salesPerson whos timeStamp is the earliest. Resulting df in this example would be:
salesPerson customer measure timeStamp
--------------------------------------
A 123 I 12:30
A 123 II 12:30
A 123 III 12:30
D 456 III 14:15
What would be the best/most pythonic way to do it? I thought about using pandas groupby.filter or groupby.transform, but frankly have no idea how to accurately write those.
Bonus points would be for having the deleted rows in a separate deleted_df object.
Upvotes: 0
Views: 323
Reputation: 5225
This one-liner should do the trick:
df[df['salesPerson'].isin(df.iloc[df.groupby(['customer'])['timeStamp'].idxmin(), 'salesPerson'])]
Explanation:
To determine the salespersons to whom we want to filter, first group df
by customer
and get the index where the minimum timeStamp
is found using idxmin
:
df.groupby(['customer'])['timeStamp'].idxmin()
Then, pass those index values to iloc
, along with the column we want, to get the values from salesPerson
we'll use for filtering:
df.iloc[df.groupby(['customer'])['timeStamp'].idxmin(), 'salesPerson']
Finally, pass that result to the Series method isin
, and use that to index into df
. The result is thus:
0 A 123 I 2017-07-12 12:30:00
1 A 123 II 2017-07-12 12:30:00
2 A 123 III 2017-07-12 12:30:00
6 D 456 III 2017-07-12 14:15:00
To create a second DataFrame with the filtered-out rows, you could pass the index from the filtered df to the original df and exclude those rows. So if we assigned the result above to df1
, we could create a complementary df2
in this manner:
df2 = df[~df.index.isin(df1.index)]
Result:
3 B 123 IV 2017-07-12 12:35:00
4 C 456 I 2017-07-12 14:30:00
5 C 456 II 2017-07-12 14:30:00
Upvotes: 2