Reputation: 551
I've got a sorted (by "customer") DataFrame which looks like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({"customer": [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3],
"revenue": [np.nan, np.nan, 4.23, np.nan, 5.1, 523, np.nan, 4.2, np.nan, 23,
np.nan, np.nan, np.nan, np.nan]})
My goal is to get rid of the last row(s) of a customer if it's not a row with a positive revenue. So, if for a specific customer no row with a positive revenue is following then the row has to be deleted. In our example this would be the last row of customer 1 and the last two rows of customer 2.
In addition, I'd like to get rid of all customers which didn't buy at all. In our example, this would be customer 3. In the following picture I highlighted the rows which shall be deleted:
At the moment I'm using the following syntax to accomplish this task:
def deleting(sub_df):
# Delete customers that didn't buy
if sub_df["revenue"].sum() == 0:
return None
# Delte all rows after the last purchase
sub_df.loc[:, "dropping"] = sub_df.loc[:, "revenue"].fillna(method="bfill")
sub_df = sub_df.loc[~sub_df["dropping"].isnull(), :]
sub_df = sub_df.drop("dropping", axis="columns")
return sub_df
df.groupby("customer").apply(deleting).reset_index(drop=True)
I feel that there has to be a better solution since it feels weired to use a groupby to delete rows of a dataframe. Also, I need a faster solution since my dataframe has 1.8 million rows. So, performance is an issue.
Thanks for your help!
Upvotes: 1
Views: 421
Reputation: 150785
You can use bfill
on groupby
:
df[df.groupby('customer').revenue.bfill().notnull()]
Output:
0 1 NaN
1 1 NaN
2 1 4.23
3 1 NaN
4 1 5.10
5 1 523.00
7 2 4.20
8 2 NaN
9 2 23.00
Upvotes: 1