king
king

Reputation: 11

self join in python data frame

I have 5 columns in a python data frame:

s.no, customer_id,status     issue_date, action_date, resolve_date
1234   600123     resolved    10-10-2021   11-11-2021   12-11-2021
1235   600123     resolved    15-10-2021   11-11-2021   12-11-2021
1236   600123     resolved    16-10-2021   11-11-2021   12-11-2021
1237   700673     in-progress 17-10-2021   29-10-2021   
1238   700673     in-progress 18-10-2021    
1239   800165     submitted   20-10-2021
1239   800165     submitted   21-10-2021    

there are four statuses(submitted, in-progress, resolved, canceled). S.no will be auto-increment and customer id is unique. The same customer can raise multiple issues. CS team cannot mark the issue as resolved until all the issues raised by the customer are resolved and it will show as in-progress for all the issues.

I want help to get a single row per customer with initial issue submitted date (issue_date), max action date, and max resolve date. For example, if 4 issues have been raised any issue is still left to resolve, then the issue_date should be the initial date of (1st issue date), the action date will be the latest date and the resolve date should be null

sample output should be grouped by customer_id

customer_id,status  ,issue_date, action_date, resolve_date
600123     resolved  10-10-2021   11-11-2021   12-11-2021
700673    in-progress 17-10-2021   29-10-2021   NULL
800165     submitted   21-10-2021  NULL         NULL

Upvotes: 1

Views: 92

Answers (1)

furas
furas

Reputation: 142983

First you can use groupby("customer_id") to work with every customer separatelly,

and next you can use agg() (aggregation) to run max(), min(), last(), first(), etc. for every column (in group) separatelly.

text = '''s.no customer_id status     issue_date action_date resolve_date
1234   600123     resolved    10-10-2021   11-11-2021   12-11-2021
1235   600123     resolved    15-10-2021   11-11-2021   12-11-2021
1236   600123     resolved    16-10-2021   11-11-2021   12-11-2021
1237   700673     in-progress 17-10-2021   29-10-2021   
1238   700673     in-progress 18-10-2021    
1239   800165     submitted   20-10-2021
1239   800165     submitted   21-10-2021    '''

import pandas as pd
import io

df = pd.read_csv(io.StringIO(text), sep="\s+")
df['issue_date'] = pd.to_datetime(df['issue_date'])
df['action_date'] = pd.to_datetime(df['action_date'])
df['resolve_date'] = pd.to_datetime(df['resolve_date'])

print(df)

df_new = df.groupby('customer_id').agg({'status':'last', 'issue_date': 'first', 'action_date': 'max', 'resolve_date': 'last'})

df_new = df_new.reset_index()

print(df_new)

Result:

   s.no  customer_id       status issue_date action_date resolve_date
0  1234       600123     resolved 2021-10-10  2021-11-11   2021-12-11
1  1235       600123     resolved 2021-10-15  2021-11-11   2021-12-11
2  1236       600123     resolved 2021-10-16  2021-11-11   2021-12-11
3  1237       700673  in-progress 2021-10-17  2021-10-29          NaT
4  1238       700673  in-progress 2021-10-18         NaT          NaT
5  1239       800165    submitted 2021-10-20         NaT          NaT
6  1239       800165    submitted 2021-10-21         NaT          NaT

   customer_id       status issue_date action_date resolve_date
0       600123     resolved 2021-10-10  2021-11-11   2021-12-11
1       700673  in-progress 2021-10-17  2021-10-29          NaT
2       800165    submitted 2021-10-20         NaT          NaT

Upvotes: 0

Related Questions