Reputation: 65
I have a dataframe with customers. Each customer has several observations and variables, and some are no longer a customer.
My question is the following:
I want to group by customer and select the last row if the customer is still a customer, and the second to last row if the customer is not a customer anymore. I have a column named churned that is 1 if the customer has churned (no longer a customer), and 0 if it is still a customer, for all observations of the customer.
I know I can select the last row (-1) or second to last row (-2) as such:
df = df.groupby(['CustomerID'],as_index=False).nth(-1).reset_index()
This also seem to work, although the order is different:
df = df.groupby(['CustomerID']).apply(lambda x: x.iloc[-1])
I have been trying to use a if else statement in a lambda function like this:
df = df.groupby(['CustomerID'],as_index=False).apply(lambda x: x.iloc[-2] if x['churned']==1 else x.iloc[-1]).reset_index()
But I get ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Is there any way I can group by CustomerID, condition on the column churned and get only the last row of the customers where churned==0, and the second to last row if churned==1?
Upvotes: 4
Views: 2761
Reputation: 5335
You are checking if x['churned']==1
for all rows in the group. To check if it presents in the group you have to use any()
:
df = df.groupby(['CustomerID'],as_index=False).apply \
(lambda x: x.iloc[-2] if (x['churned']==1).any() \
else x.iloc[-1]).reset_index()
Upvotes: 1
Reputation: 6574
You need to divide them into two dataframes - customers and non-customers, and use the last() function:
df.groupby(['CustomerID'],as_index=False).last()
Upvotes: 3