Reputation: 36
I have following dataframe
data = [
['ticket_1', '''2021-04-18 11:11:23 - Unknown
Part replaced
2021-04-18 09:03:10 - John Doe
Requires part change. Technician enroute
2021-04-17 19:30:25 - John Doe
Working on the issue'''],
['ticket_2','''2021-04-17 19:13:09 - Unknown_2
Transferring to other resource
2021-04-17 19:12:34 - Unknown_2
Issue assigned''']
]
I would like to split the comments column and perform below steps
1. Extract date time and include in another column
2. Have string after date time in another column
3. Duplicate ticket_num based on multiple date time in comments column
For example
data_1 = [
['ticket_1','''2021-04-18 11:11:23''','''Unknown
Part replaced'''],
['ticket_1', '''2021-04-18 09:03:10''', '''John Doe
Requires part change. Technician enroute'''],
['ticket_1','''2021-04-17 19:30:25''', '''John Doe
Working on the issue'''],
['ticket_2','''2021-04-17 19:13:09''','''Unknown_2
Transferring to other resource'''],
['ticket_2','''2021-04-17 19:12:34''','''Unknown_2
Issue assigned''']
]
I have tried using combinations of regex, explode, split and slice but not getting fruitful results.
Upvotes: 0
Views: 95
Reputation: 18406
Assuming the dataframe has following structure:
ticket_num comments
0 ticket_1 2021-04-18 11:11:23 - Unknown\nPart replaced\n2021-04-18 09:03:10 - John Doe\nRequires part change. Technician enroute\n2021-04-17 19:30:25 - John Doe\nWorking on the issue
1 ticket_2 2021-04-17 19:13:09 - Unknown_2\nTransferring to other resource\n2021-04-17 19:12:34 - Unknown_2\nIssue assigned
Start by finding all the occurances of the string values in comments
column on datetime value followed by any character until next datetime value or the end of the multiline string i.e. \Z
, using Series.str.split
, and passing the proper regex. You also need to pass re.DOTALL
flag since it's a multiline string, then explode
the dataframe on this new columns, and convert the value of datetime and comment to two separate columns applying pandas Series on them, and performing the concatenation on axis=1
:
df['extracted'] = (df['comments']
.str
.findall('(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})(.*?(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}|\Z))',
flags=re.DOTALL)
)
df = df.explode('extracted')
df = pd.concat([df, df['extracted'].apply(lambda x: pd.Series(x, index=['datetime', 'comment']))], axis=1)
You can now drop the unwanted columns, and also strip off the new comment
column to remove any hyphen -
and white spaces at the beginning or the end, and if needed, you can also convert datetime
column to pandas datetime type.
OUTPUT:
>>> df[['ticket_num', 'datetime', 'comment']]
ticket_num datetime comment
0 ticket_1 2021-04-18 11:11:23 - Unknown\nPart replaced\n
0 ticket_1 2021-04-18 09:03:10 - John Doe\nRequires part change. Technician enroute\n
0 ticket_1 2021-04-17 19:30:25 - John Doe\nWorking on the issue
1 ticket_2 2021-04-17 19:13:09 - Unknown_2\nTransferring to other resource\n
1 ticket_2 2021-04-17 19:12:34 - Unknown_2\nIssue assigned
Upvotes: 1