rods_91
rods_91

Reputation: 13

Filter dataframe based on duplicate entries conditions

I have a df with entries that are repeated due to revisions. I am trying to filter them out by their "status". For example:

>>> import pandas as pd
>>> d = pd.DataFrame( { 'Item':['Submission#1','Submission#1','Submission#2','Su
bmission#2','Submission#2','Submission#3','Submission#3'], 'Revision' : [0,1,0,1
,2,0,1], 'Status' : ['R','A','R','P','A','R','P']} )
>>> d
           Item  Revision Status
0  Submission#1         0      R
1  Submission#1         1      A
2  Submission#2         0      R
3  Submission#2         1      P
4  Submission#2         2      A
5  Submission#3         0      R
6  Submission#3         1      P

I want to keep the "Submissions" which are still "P" (pending) in their latest revision. "Submission#3" entries will only be recorded as "Submission#1" has been approved on the latest revision and "Submission#2" tho pending in "Revision" 1, was approved on "Revision" 2.

Result:

           Item  Revision Status
5  Submission#3         0      R
6  Submission#3         1      P

Could anyone guide me on how to filter data by looking for other instances of their entries, finding all "Revisions" made and selecting them based on the latest status?

Upvotes: 1

Views: 99

Answers (1)

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

We can use groupby and isin based on boolean indexing i.e

new = d.loc[d.groupby('Item')['Revision'].idxmax()]

#         Item  Revision Status
#1  Submission#1         1      A
#4  Submission#2         2      A
#6  Submission#3         1      P

sub = new[new['Status'] == 'P']['Item']

#6    Submission#3
#Name: Item, dtype: object

late = d[d['Item'].isin(sub)]

Output late :

          Item  Revision Status
5  Submission#3         0      R
6  Submission#3         1      P

Upvotes: 1

Related Questions