Wisdom258
Wisdom258

Reputation: 173

Create a new Dataframe based on Time Difference and a condition on columns in pandas dataframe

I have a dataframe, where I have a ticket IDs and subsequent actions taken with the date time column a shown in the table below.

ticketID    ChangeDate  OldStatus   NewStatus
0   1012327 2019-03-18 09:00:32.903 R or O  Action mail sent to client
1   1012327 2019-03-18 09:21:34.820 Action mail sent to client  Response Client - R
2   1012327 2019-03-18 09:34:21.890 Response Client - R Status Updated
3   1012328 2019-03-18 07:00:09.960 R or O  ticket Closed - None
4   1012328 2019-03-18 07:09:31.420 ticket Closed - None    Status Updated
5   1012329 2019-03-18 06:52:03.490 R or O  ticket Closed - Satisfied
6   1012329 2019-03-18 07:09:33.433 ticket Closed - Satisfied   Status Updated
7   1012330 2019-03-18 10:25:13.493 R or O  Action mail sent to Service
8   1012330 2019-03-18 10:55:20.963 Action mail sent to Service ticket Closed - Service Responded
9   1012330 2019-03-18 11:02:05.327 ticket Closed - Service Responded   Status Updated
10  1012332 2019-03-18 09:00:41.967 R or O  Action mail sent to client
11  1012332 2019-03-18 10:24:20.150 Action mail sent to client  Response Client - R
12  1012332 2019-03-18 10:32:40.717 Response Client - R Status Updated

Dataframe in Pandas Table

Now, I have certain ticket IDs and some of them are having more observations based on the statuses provided. You can also see that, for a ticket id, for next observation, new status becomes old status and a new status is provided for the same, which keeps updated, until some closing action i taken.

I want to create a new dataframe/series which has the following format.

ticket ID  | Datetime1  | Oldest Status  | New Status | Datetime2  | New Status2| Datetime3  | New Status3 ....

so that I have the oldest status and new statuses with dates as shown above till we do this for each of the ticket ID.

My plan is to use this dataset, to calculate the time differences later on.

Upvotes: 1

Views: 93

Answers (2)

Serge Ballesta
Serge Ballesta

Reputation: 148965

I would first use a groupby on ticketID to compute a rank per ticket, then pivot the dataframe using that rank as column and ticketID as index to obtain the expected data.

After sorting the columns you get the expected dataframe. Time to rename the columns and reset the index to have a nice dataframe. Code could be:

df['rank'] = df.groupby('ticketID').apply(lambda x:
                                          pd.Series(range(len(x)))).values
resul = df.pivot('ticketID', 'rank').fillna('')
resul.columns = resul.columns.swaplevel()
resul.sort_index(axis=1,inplace=True, level=0, sort_remaining=False)
resul.columns = ['{1}_{0}'.format(*c) for c in resul.columns]
resul.reset_index(inplace=True)

With your example data it gives:

   ticketID             ChangeDate_0 OldStatus_0                  NewStatus_0             ChangeDate_1                  OldStatus_1                        NewStatus_1             ChangeDate_2                        OldStatus_2     NewStatus_2
0   1012327  2019-03-18 09:00:32.903      R or O   Action mail sent to client  2019-03-18 09:21:34.820   Action mail sent to client                Response Client - R  2019-03-18 09:34:21.890                Response Client - R  Status Updated
1   1012328  2019-03-18 07:00:09.960      R or O         ticket Closed - None  2019-03-18 07:09:31.420         ticket Closed - None                     Status Updated                                                                            
2   1012329  2019-03-18 06:52:03.490      R or O    ticket Closed - Satisfied  2019-03-18 07:09:33.433    ticket Closed - Satisfied                     Status Updated                                                                            
3   1012330  2019-03-18 10:25:13.493      R or O  Action mail sent to Service  2019-03-18 10:55:20.963  Action mail sent to Service  ticket Closed - Service Responded  2019-03-18 11:02:05.327  ticket Closed - Service Responded  Status Updated
4   1012332  2019-03-18 09:00:41.967      R or O   Action mail sent to client  2019-03-18 10:24:20.150   Action mail sent to client                Response Client - R  2019-03-18 10:32:40.717                Response Client - R  Status Updated

Upvotes: 2

Erfan
Erfan

Reputation: 42916

We can groupby on ticketID and aggregate to a list for each column. Then we split these lists to columns and give them a prefix in this case to be able to sort them correctly:

def aggregate_data(dataframe, groupcol, col):
    grp = dataframe.groupby(groupcol).agg(list)

    temp = pd.DataFrame(grp[col].to_numpy().tolist(), index=grp.index)

    temp.columns = [f'{c}_{col}' for c in temp.columns]

    return temp

new = pd.concat([aggregate_data(df, 'ticketID', column) for column in df.iloc[:, 1:].columns], 
                axis=1).sort_index(axis=1).reset_index()
   ticketID             0_ChangeDate                  0_NewStatus 0_OldStatus             1_ChangeDate                        1_NewStatus                  1_OldStatus             2_ChangeDate
0   1012327  2019-03-18 09:00:32.903   Action mail sent to client      R or O  2019-03-18 09:21:34.820                Response Client - R   Action mail sent to client  2019-03-18 09:34:21.890
1   1012328  2019-03-18 07:00:09.960         ticket Closed - None      R or O  2019-03-18 07:09:31.420                     Status Updated         ticket Closed - None                     None
2   1012329  2019-03-18 06:52:03.490    ticket Closed - Satisfied      R or O  2019-03-18 07:09:33.433                     Status Updated    ticket Closed - Satisfied                     None
3   1012330  2019-03-18 10:25:13.493  Action mail sent to Service      R or O  2019-03-18 10:55:20.963  ticket Closed - Service Responded  Action mail sent to Service  2019-03-18 11:02:05.327
4   1012332  2019-03-18 09:00:41.967   Action mail sent to client      R or O  2019-03-18 10:24:20.150                Response Client - R   Action mail sent to client  2019-03-18 10:32:40.717

Upvotes: 2

Related Questions