Preben
Preben

Reputation: 65

Pandas groupby select last row or second to last row based on value (0 or 1) in another column

I have a dataframe with customers. Each customer has several observations and variables, and some are no longer a customer.

My question is the following:

I want to group by customer and select the last row if the customer is still a customer, and the second to last row if the customer is not a customer anymore. I have a column named churned that is 1 if the customer has churned (no longer a customer), and 0 if it is still a customer, for all observations of the customer.

I know I can select the last row (-1) or second to last row (-2) as such:
df = df.groupby(['CustomerID'],as_index=False).nth(-1).reset_index()

This also seem to work, although the order is different:
df = df.groupby(['CustomerID']).apply(lambda x: x.iloc[-1])

I have been trying to use a if else statement in a lambda function like this:
df = df.groupby(['CustomerID'],as_index=False).apply(lambda x: x.iloc[-2] if x['churned']==1 else x.iloc[-1]).reset_index()

But I get ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Is there any way I can group by CustomerID, condition on the column churned and get only the last row of the customers where churned==0, and the second to last row if churned==1?

Upvotes: 4

Views: 2761

Answers (2)

Michael
Michael

Reputation: 5335

You are checking if x['churned']==1 for all rows in the group. To check if it presents in the group you have to use any():

df = df.groupby(['CustomerID'],as_index=False).apply \
    (lambda x: x.iloc[-2] if (x['churned']==1).any() \
    else x.iloc[-1]).reset_index()

Upvotes: 1

gtomer
gtomer

Reputation: 6574

You need to divide them into two dataframes - customers and non-customers, and use the last() function:

df.groupby(['CustomerID'],as_index=False).last()

Upvotes: 3

Related Questions