Tessa
Tessa

Reputation: 53

Drop rows at beginning of group with a specific value in pandas groupby

I have a pandas (multi-index) dataframe as follows:

             date       Volume
Account ID                     
10001   2  02-03-2017   0
        3  02-03-2017   0
        3  09-03-2017   0
        3  16-03-2017   50
        3  21-03-2017   65
        3  28-03-2017   0
        3  04-04-2017   0
        3  11-04-2017   60
10002   5  02-03-2017   14.5
        6  09-03-2017   14.5
        3  09-03-2017   0
        3  16-03-2017   0
        3  21-03-2017   20
        3  28-03-2017   33
10003   8  21-03-2017   14.5
        9  28-03-2017   15.0

Now I want to delete all rows at the beginning of a series (dates of an account-product combination) with volume 0. So I want to keep the rows with volume 0 in case they are in the middle or at the end of a series.

So in the above example, I'd want the following output:

             date       Volume
Account ID                     
10001   3  16-03-2017   50
        3  21-03-2017   65
        3  28-03-2017   0
        3  04-04-2017   0
        3  11-04-2017   60
10002   5  02-03-2017   14.5
        6  09-03-2017   14.5
        3  21-03-2017   20
        3  28-03-2017   33
10003   8  21-03-2017   14.5
        9  28-03-2017   15.0

Currently, I've been removing complete series with a filter, e.g.

df = data.groupby(level = acc_prod).filter(lambda x: len(x) > 26)

And I've seen examples of removing only the first row; Python: Pandas - Delete the first row by group . Yet I do not know how to only delete rows of zero at the beginning of an account-product series.

Would be great if someone could help me out on this!

Upvotes: 5

Views: 2142

Answers (1)

jezrael
jezrael

Reputation: 863166

You can use boolean indexing with mask created by groupby with cumsum and find values which are not 0:

print (df.groupby(level=['Account','ID'])['Volume'].cumsum())
Account  ID
10001    2       0.0
         3       0.0
         3       0.0
         3      50.0
         3     115.0
         3     115.0
         3     115.0
         3     175.0
10002    5      14.5
         6      14.5
         3       0.0
         3       0.0
         3      20.0
         3      53.0
10003    8      14.5
         9      15.0
Name: Volume, dtype: float64
mask = df.groupby(level=['Account','ID'])['Volume'].cumsum() != 0
#!= is same as ne function
#mask = df.groupby(level=['Account','ID'])['Volume'].cumsum().ne(0)
print (mask)
Account  ID
10001    2     False
         3     False
         3     False
         3      True
         3      True
         3      True
         3      True
         3      True
10002    5      True
         6      True
         3     False
         3     False
         3      True
         3      True
10003    8      True
         9      True
Name: Volume, dtype: bool
print (df[mask])
                  date  Volume
Account ID                    
10001   3   16-03-2017    50.0
        3   21-03-2017    65.0
        3   28-03-2017     0.0
        3   04-04-2017     0.0
        3   11-04-2017    60.0
10002   5   02-03-2017    14.5
        6   09-03-2017    14.5
        3   21-03-2017    20.0
        3   28-03-2017    33.0
10003   8   21-03-2017    14.5
        9   28-03-2017    15.0

Upvotes: 4

Related Questions