Reputation: 13
I have a df with entries that are repeated due to revisions. I am trying to filter them out by their "status". For example:
>>> import pandas as pd
>>> d = pd.DataFrame( { 'Item':['Submission#1','Submission#1','Submission#2','Su
bmission#2','Submission#2','Submission#3','Submission#3'], 'Revision' : [0,1,0,1
,2,0,1], 'Status' : ['R','A','R','P','A','R','P']} )
>>> d
Item Revision Status
0 Submission#1 0 R
1 Submission#1 1 A
2 Submission#2 0 R
3 Submission#2 1 P
4 Submission#2 2 A
5 Submission#3 0 R
6 Submission#3 1 P
I want to keep the "Submissions" which are still "P" (pending) in their latest revision. "Submission#3" entries will only be recorded as "Submission#1" has been approved on the latest revision and "Submission#2" tho pending in "Revision" 1, was approved on "Revision" 2.
Result:
Item Revision Status
5 Submission#3 0 R
6 Submission#3 1 P
Could anyone guide me on how to filter data by looking for other instances of their entries, finding all "Revisions" made and selecting them based on the latest status?
Upvotes: 1
Views: 99
Reputation: 30605
We can use groupby and isin based on boolean indexing i.e
new = d.loc[d.groupby('Item')['Revision'].idxmax()]
# Item Revision Status
#1 Submission#1 1 A
#4 Submission#2 2 A
#6 Submission#3 1 P
sub = new[new['Status'] == 'P']['Item']
#6 Submission#3
#Name: Item, dtype: object
late = d[d['Item'].isin(sub)]
Output late :
Item Revision Status 5 Submission#3 0 R 6 Submission#3 1 P
Upvotes: 1