Reputation: 11
I have 5 columns in a python data frame:
s.no, customer_id,status issue_date, action_date, resolve_date
1234 600123 resolved 10-10-2021 11-11-2021 12-11-2021
1235 600123 resolved 15-10-2021 11-11-2021 12-11-2021
1236 600123 resolved 16-10-2021 11-11-2021 12-11-2021
1237 700673 in-progress 17-10-2021 29-10-2021
1238 700673 in-progress 18-10-2021
1239 800165 submitted 20-10-2021
1239 800165 submitted 21-10-2021
there are four statuses(submitted, in-progress, resolved, canceled). S.no will be auto-increment and customer id is unique. The same customer can raise multiple issues. CS team cannot mark the issue as resolved until all the issues raised by the customer are resolved and it will show as in-progress for all the issues.
I want help to get a single row per customer with initial issue submitted date (issue_date), max action date, and max resolve date. For example, if 4 issues have been raised any issue is still left to resolve, then the issue_date should be the initial date of (1st issue date), the action date will be the latest date and the resolve date should be null
sample output should be grouped by customer_id
customer_id,status ,issue_date, action_date, resolve_date
600123 resolved 10-10-2021 11-11-2021 12-11-2021
700673 in-progress 17-10-2021 29-10-2021 NULL
800165 submitted 21-10-2021 NULL NULL
Upvotes: 1
Views: 92
Reputation: 142983
First you can use groupby("customer_id")
to work with every customer separatelly,
and next you can use agg()
(aggregation
) to run max()
, min()
, last()
, first()
, etc. for every column (in group) separatelly.
text = '''s.no customer_id status issue_date action_date resolve_date
1234 600123 resolved 10-10-2021 11-11-2021 12-11-2021
1235 600123 resolved 15-10-2021 11-11-2021 12-11-2021
1236 600123 resolved 16-10-2021 11-11-2021 12-11-2021
1237 700673 in-progress 17-10-2021 29-10-2021
1238 700673 in-progress 18-10-2021
1239 800165 submitted 20-10-2021
1239 800165 submitted 21-10-2021 '''
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text), sep="\s+")
df['issue_date'] = pd.to_datetime(df['issue_date'])
df['action_date'] = pd.to_datetime(df['action_date'])
df['resolve_date'] = pd.to_datetime(df['resolve_date'])
print(df)
df_new = df.groupby('customer_id').agg({'status':'last', 'issue_date': 'first', 'action_date': 'max', 'resolve_date': 'last'})
df_new = df_new.reset_index()
print(df_new)
Result:
s.no customer_id status issue_date action_date resolve_date
0 1234 600123 resolved 2021-10-10 2021-11-11 2021-12-11
1 1235 600123 resolved 2021-10-15 2021-11-11 2021-12-11
2 1236 600123 resolved 2021-10-16 2021-11-11 2021-12-11
3 1237 700673 in-progress 2021-10-17 2021-10-29 NaT
4 1238 700673 in-progress 2021-10-18 NaT NaT
5 1239 800165 submitted 2021-10-20 NaT NaT
6 1239 800165 submitted 2021-10-21 NaT NaT
customer_id status issue_date action_date resolve_date
0 600123 resolved 2021-10-10 2021-11-11 2021-12-11
1 700673 in-progress 2021-10-17 2021-10-29 NaT
2 800165 submitted 2021-10-20 NaT NaT
Upvotes: 0