Pepe
Pepe

Reputation: 484

Slicing a Pandas df based on column values

How would you do the following?

And put it all in one pandas Dataframe. See Expected Output

Available information

df1: Here you'll find transactions with customer id, date, purchase1 and purchase2.

    ID  DATE        P1  P2
0   1   2003-04-01  449 55
1   4   2003-02-01  406 213
2   3   2003-11-01  332 372
3   1   2003-03-01  61  336
4   3   2003-10-01  428 247
5   3   2003-12-01  335 339
6   3   2003-09-01  367 41
7   2   2003-01-01  11  270
8   1   2003-01-01  55  102
9   2   2003-02-01  244 500
10  1   2003-02-01  456 272
11  5   2003-03-01  240 180
12  4   2002-12-01  156 152
13  5   2003-01-01  144 185
14  4   2003-01-01  246 428
15  1   2003-05-01  492 97
16  5   2003-02-01  371 66
17  5   2003-04-01  246 428
18  5   2003-05-01  406 213

df2: Here you'll find customer ID, whether they leave the company or not and the month when they left (E.g. 3.0 = March)

    ID  Churn   Churn_Month
0   1       1   3.0
1   2       0   0.0
2   3       1   12.0
3   4       0   0.0
4   5       1   4.0

Expected Output:

Mean of P1 and P2 by ID, merged with df2 information. ID will be the new index.

ID  P1      P2      Churn   Churn_Month
1   190.6   236.6    1         3.0
2   127.5   385      0         0.0
3   365     319.3    1         12.0
4   269.3   264.3    0         0.0
5   285.6   224.6    1         4.0

Upvotes: 0

Views: 154

Answers (1)

Ian Thompson
Ian Thompson

Reputation: 3285

Some extra details were necessary here. First, when Churn == 1 assume that the customer left. Using df2 you can determine which month they left and remove any data that occurred after. From there it's pretty straight forward in terms of grouping, aggregating, and filter the data.

# merge
df3 = df1.merge(df2)

# convert DATE to datetime
df3.DATE = pd.to_datetime(df3.DATE)

# filter rows where month of (DATE is <= Churn_Month and Churn == 1)
# or Churn == 0
df3 = df3.loc[
    ((df3.Churn == 1) & (df3.DATE.dt.month <= df3.Churn_Month)) |
    (df3.Churn == 0)
].copy()

# sort values ascending
df3.sort_values([
    'ID',
    'DATE',
    'P1',
    'P2',
    'Churn',
    'Churn_Month'
], inplace=True)

# groupby ID, Churn
# take last 3 DATEs
# merge with original to filter rows
# group on ID, Churn, and Churn_Month
# average P1 and P2
# reset_index to get columns back
# round results to 1 decimal at the end
df3.groupby([
    'ID',
    'Churn'
]).DATE.nth([
    -1, -2, -3
]).reset_index().merge(df3).groupby([
    'ID',
    'Churn',
    'Churn_Month'
])[[
    'P1',
    'P2'
]].mean().reset_index().round(1)

Results

   ID  Churn  Churn_Month     P1     P2
0   1      1          3.0  190.7  236.7
1   2      0          0.0  127.5  385.0
2   3      1         12.0  365.0  319.3
3   4      0          0.0  269.3  264.3
4   5      1          4.0  285.7  224.7

Upvotes: 1

Related Questions