Sara
Sara

Reputation: 973

groupby and selecting in pandas

I'm trying to remove some rows in my table based on the change column.

id  subindex    change
A1      1           0
A1      2           1
A1      3           12
A1      4           0
A1      5           15
A1      6           1
A2      1           0
A2      2           11
A2      3           1
A2      4           0   

In a subgroup (Id is identical) When change is higher than 10, I want to remove the subsequent rows. So the table would be:

 id subindex    change
A1      1           0
A1      2           1
A2      1           0

What I did is a groupby to identify where the sub-tables should be cut:

df['cut_position']=df[df.change >= 10].groupby('id')['subindex'].transform(lambda x: x.min())

that gives:

id  subindex    change      cut_position
A1      1           0           0
A1      2           1           0
A1      3           12          3
A1      4           0           0
A1      5           15          0
A1      6           1           0   
A2      1           0           0
A2      2           11          2
A2      3           1           0
A2      4           0           0

Then a groupby('id') should be easy to do, but I get problems trying to obtain that table:

id  subindex    change      cut_position
A1      1           0           0
A1      2           1           0
A1      3           12          ToRemove_3
A1      4           0           ToRemove_0
A1      5           15          ToRemove_0
A1      6           1           ToRemove_0  
A2      1           0           0
A2      2           11          ToRemove_2
A2      3           1           ToRemove_0
A2      4           0           ToRemove_0

When writing the code that could generate the table, I get :

df.groupby('id')[df['subindex'] >= df['cut_position'].max()].map(lambda x : 'ToRemove_' + x)

KeyError: 'Columns not found: False, True'

Note: I did something that worked, but it took way too much time to do it...

Upvotes: 0

Views: 5931

Answers (1)

Bob Haffner
Bob Haffner

Reputation: 8503

Not 100% satisfied with this approach, but you could try it.

Given your frame

id  subindex    change
A1      1           0
A1      2           1
A1      3           12
A1      4           0
A1      5           15
A1      6           1
A2      1           0
A2      2           11
A2      3           1
A2      4           0 

Then create a frame with the first row of each 'id' thats greater than 10

rowindex = df['change'] > 10
greaterThan10 = df[rowindex].groupby('id',as_index=False).first()

Than a merge() of your original frame and greaterThan10

dfKeep = pd.merge(df,greaterThan10,how='left',on=['id'],suffixes=['','_cut'])

Then filtering the new frame for the 'keepers'

mask = (dfKeep['subindex'] < dfKeep['subindex_cut']) 
dfKeep[mask][['id','subindex','change']]

gives you

    id  subindex    change
0   A1  1           0
1   A1  2           1
6   A2  1           0

Upvotes: 2

Related Questions