Slicing a Pandas df based on column values

Question

How would you do the following?

For customers with churn=1, take the average of their last 3 purchases based on the month they leave. E.g. Churn_Month=3, then average last 3 purchases: from Mar, Feb and Jan if available. Sometimes would be 2 or 1 purchase.
For customers with churn=0, take the average of their last 3 purchases when available, sometimes would be 2 or 1 purchase.

And put it all in one pandas Dataframe. See Expected Output

Available information

df1: Here you'll find transactions with customer id, date, purchase1 and purchase2.

    ID  DATE        P1  P2
0   1   2003-04-01  449 55
1   4   2003-02-01  406 213
2   3   2003-11-01  332 372
3   1   2003-03-01  61  336
4   3   2003-10-01  428 247
5   3   2003-12-01  335 339
6   3   2003-09-01  367 41
7   2   2003-01-01  11  270
8   1   2003-01-01  55  102
9   2   2003-02-01  244 500
10  1   2003-02-01  456 272
11  5   2003-03-01  240 180
12  4   2002-12-01  156 152
13  5   2003-01-01  144 185
14  4   2003-01-01  246 428
15  1   2003-05-01  492 97
16  5   2003-02-01  371 66
17  5   2003-04-01  246 428
18  5   2003-05-01  406 213

df2: Here you'll find customer ID, whether they leave the company or not and the month when they left (E.g. 3.0 = March)

    ID  Churn   Churn_Month
0   1       1   3.0
1   2       0   0.0
2   3       1   12.0
3   4       0   0.0
4   5       1   4.0

Expected Output:

Mean of P1 and P2 by ID, merged with df2 information. ID will be the new index.

ID  P1      P2      Churn   Churn_Month
1   190.6   236.6    1         3.0
2   127.5   385      0         0.0
3   365     319.3    1         12.0
4   269.3   264.3    0         0.0
5   285.6   224.6    1         4.0

Ian Thompson · Accepted Answer

Some extra details were necessary here. First, when Churn == 1 assume that the customer left. Using df2 you can determine which month they left and remove any data that occurred after. From there it's pretty straight forward in terms of grouping, aggregating, and filter the data.

# merge
df3 = df1.merge(df2)

# convert DATE to datetime
df3.DATE = pd.to_datetime(df3.DATE)

# filter rows where month of (DATE is <= Churn_Month and Churn == 1)
# or Churn == 0
df3 = df3.loc[
    ((df3.Churn == 1) & (df3.DATE.dt.month <= df3.Churn_Month)) |
    (df3.Churn == 0)
].copy()

# sort values ascending
df3.sort_values([
    'ID',
    'DATE',
    'P1',
    'P2',
    'Churn',
    'Churn_Month'
], inplace=True)

# groupby ID, Churn
# take last 3 DATEs
# merge with original to filter rows
# group on ID, Churn, and Churn_Month
# average P1 and P2
# reset_index to get columns back
# round results to 1 decimal at the end
df3.groupby([
    'ID',
    'Churn'
]).DATE.nth([
    -1, -2, -3
]).reset_index().merge(df3).groupby([
    'ID',
    'Churn',
    'Churn_Month'
])[[
    'P1',
    'P2'
]].mean().reset_index().round(1)

Results

   ID  Churn  Churn_Month     P1     P2
0   1      1          3.0  190.7  236.7
1   2      0          0.0  127.5  385.0
2   3      1         12.0  365.0  319.3
3   4      0          0.0  269.3  264.3
4   5      1          4.0  285.7  224.7

Slicing a Pandas df based on column values

Answers (1)

Related Questions