Greg
Greg

Reputation: 101

DataFrame filter based on groupby

Here is my simplified example df:

      salesPerson customer measure timeStamp
      --------------------------------------
      A           123      I       12:30
      A           123      II      12:30
      A           123      III     12:30
      B           123      IV      12:35
      C           456      I       14:30
      C           456      II      14:30
      D           456      III     14:15

What I want to do, it to filer the dataframe and in cases when 2 different salesPerson Id's have the same customer number, keep all the rows of the salesPerson whos timeStamp is the earliest. Resulting df in this example would be:

      salesPerson customer measure timeStamp
      --------------------------------------
      A           123      I       12:30
      A           123      II      12:30
      A           123      III     12:30
      D           456      III     14:15

What would be the best/most pythonic way to do it? I thought about using pandas groupby.filter or groupby.transform, but frankly have no idea how to accurately write those.

Bonus points would be for having the deleted rows in a separate deleted_df object.

Upvotes: 0

Views: 323

Answers (1)

cmaher
cmaher

Reputation: 5225

This one-liner should do the trick:

df[df['salesPerson'].isin(df.iloc[df.groupby(['customer'])['timeStamp'].idxmin(), 'salesPerson'])]

Explanation:

To determine the salespersons to whom we want to filter, first group df by customer and get the index where the minimum timeStamp is found using idxmin:

df.groupby(['customer'])['timeStamp'].idxmin()

Then, pass those index values to iloc, along with the column we want, to get the values from salesPerson we'll use for filtering:

df.iloc[df.groupby(['customer'])['timeStamp'].idxmin(), 'salesPerson']

Finally, pass that result to the Series method isin, and use that to index into df. The result is thus:

0  A  123    I 2017-07-12 12:30:00
1  A  123   II 2017-07-12 12:30:00
2  A  123  III 2017-07-12 12:30:00
6  D  456  III 2017-07-12 14:15:00

To create a second DataFrame with the filtered-out rows, you could pass the index from the filtered df to the original df and exclude those rows. So if we assigned the result above to df1, we could create a complementary df2 in this manner:

df2 = df[~df.index.isin(df1.index)]

Result:

3  B  123  IV 2017-07-12 12:35:00
4  C  456   I 2017-07-12 14:30:00
5  C  456  II 2017-07-12 14:30:00

Upvotes: 2

Related Questions