the_economist
the_economist

Reputation: 551

Dropping rows of DataFrame: Faster alternative for groupby.apply

I've got a sorted (by "customer") DataFrame which looks like this:

import pandas as pd
import numpy as np

df = pd.DataFrame({"customer": [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3], 
                   "revenue": [np.nan, np.nan, 4.23, np.nan, 5.1, 523, np.nan, 4.2, np.nan, 23, 
                               np.nan, np.nan, np.nan, np.nan]})

My goal is to get rid of the last row(s) of a customer if it's not a row with a positive revenue. So, if for a specific customer no row with a positive revenue is following then the row has to be deleted. In our example this would be the last row of customer 1 and the last two rows of customer 2.

In addition, I'd like to get rid of all customers which didn't buy at all. In our example, this would be customer 3. In the following picture I highlighted the rows which shall be deleted:

enter image description here

At the moment I'm using the following syntax to accomplish this task:

def deleting(sub_df):
    # Delete customers that didn't buy
    if sub_df["revenue"].sum() == 0:
        return None

    # Delte all rows after the last purchase
    sub_df.loc[:, "dropping"] = sub_df.loc[:, "revenue"].fillna(method="bfill")
    sub_df = sub_df.loc[~sub_df["dropping"].isnull(), :]
    sub_df = sub_df.drop("dropping", axis="columns")
    return sub_df

df.groupby("customer").apply(deleting).reset_index(drop=True)

I feel that there has to be a better solution since it feels weired to use a groupby to delete rows of a dataframe. Also, I need a faster solution since my dataframe has 1.8 million rows. So, performance is an issue.

Thanks for your help!

Upvotes: 1

Views: 421

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150785

You can use bfill on groupby:

df[df.groupby('customer').revenue.bfill().notnull()]

Output:

0         1      NaN
1         1      NaN
2         1     4.23
3         1      NaN
4         1     5.10
5         1   523.00
7         2     4.20
8         2      NaN
9         2    23.00

Upvotes: 1

Related Questions