RaviP
RaviP

Reputation: 36

Split / Slice column values (text data) into multiple rows

I have following dataframe

data = [
    ['ticket_1', '''2021-04-18 11:11:23 - Unknown
Part replaced

2021-04-18 09:03:10 - John Doe
Requires part change. Technician enroute

2021-04-17 19:30:25 - John Doe
Working on the issue'''], 
    ['ticket_2','''2021-04-17 19:13:09 - Unknown_2
Transferring to other resource

2021-04-17 19:12:34 - Unknown_2
Issue assigned''']
]

original dataframe

I would like to split the comments column and perform below steps

1. Extract date time and include in another column

2. Have string after date time in another column

3. Duplicate ticket_num based on multiple date time in comments column

For example

data_1 = [
    ['ticket_1','''2021-04-18 11:11:23''','''Unknown
Part replaced'''],
    ['ticket_1', '''2021-04-18 09:03:10''', '''John Doe
Requires part change. Technician enroute'''],
    ['ticket_1','''2021-04-17 19:30:25''', '''John Doe
Working on the issue'''],
    ['ticket_2','''2021-04-17 19:13:09''','''Unknown_2
Transferring to other resource'''],
    ['ticket_2','''2021-04-17 19:12:34''','''Unknown_2
Issue assigned''']
]

desired output

I have tried using combinations of regex, explode, split and slice but not getting fruitful results.

Upvotes: 0

Views: 95

Answers (1)

ThePyGuy
ThePyGuy

Reputation: 18406

Assuming the dataframe has following structure:

  ticket_num                                                                                                                                                                      comments
0   ticket_1  2021-04-18 11:11:23 - Unknown\nPart replaced\n2021-04-18 09:03:10 - John Doe\nRequires part change. Technician enroute\n2021-04-17 19:30:25 - John Doe\nWorking on the issue
1   ticket_2                                                              2021-04-17 19:13:09 - Unknown_2\nTransferring to other resource\n2021-04-17 19:12:34 - Unknown_2\nIssue assigned

Start by finding all the occurances of the string values in comments column on datetime value followed by any character until next datetime value or the end of the multiline string i.e. \Z, using Series.str.split, and passing the proper regex. You also need to pass re.DOTALL flag since it's a multiline string, then explode the dataframe on this new columns, and convert the value of datetime and comment to two separate columns applying pandas Series on them, and performing the concatenation on axis=1:

df['extracted'] = (df['comments']
                    .str
                    .findall('(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})(.*?(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}|\Z))',
                            flags=re.DOTALL)
                   )
df  = df.explode('extracted')
df = pd.concat([df, df['extracted'].apply(lambda x: pd.Series(x, index=['datetime',  'comment']))], axis=1)

You can now drop the unwanted columns, and also strip off the new comment column to remove any hyphen - and white spaces at the beginning or the end, and if needed, you can also convert datetime column to pandas datetime type.

OUTPUT:

>>> df[['ticket_num', 'datetime',  'comment']]

  ticket_num             datetime                                                comment
0   ticket_1  2021-04-18 11:11:23                             - Unknown\nPart replaced\n 
0   ticket_1  2021-04-18 09:03:10 - John Doe\nRequires part change. Technician enroute\n 
0   ticket_1  2021-04-17 19:30:25                       - John Doe\nWorking on the issue 
1   ticket_2  2021-04-17 19:13:09          - Unknown_2\nTransferring to other resource\n 
1   ticket_2  2021-04-17 19:12:34                            - Unknown_2\nIssue assigned 

Upvotes: 1

Related Questions