Reputation: 1815
How can I match the below with a pandas extractall regex:
stringwithinmycolumn
stuff, Duration: 15h:22m:33s, notstuff,
stuff, Duration: 18h:22m:33s, notstuff,
Currently, I am using the below:
df.message.str.extractall(r',([^,]*?): ([^,:]*?,').reset_index()
Expected output:
0 1
match
0 Duration 15h:22m:33s
1 Duration 18h:22m:33s
I am not able to match so far.
Upvotes: 0
Views: 628
Reputation: 626870
You may use
,\s*([^,:]+):\s*([^,]+),
See the regex demo
It matches:
,
- a comma\s*
- 0+ whitespaces([^,:]+)
- Group 1: - 0+ chars other than ,
and :
:
- a colon\s*
- 0+ whitespaces([^,]+)
- Group 2: one or more chars other than ,
,
- a comma (this actually can be removed, but may stay to ensure safer matching.)Note that you may consider making your regex more precise when you need to extract structured information from long strings. So, you may want to use letter matching pattern to match Duration
, and only digits, colon, h
, m
or s
to extract the time value. So, the pattern will become a bit more verbose:
,\s*([A-Za-z]+):\s*([\d:hms]+)
but much safer. See another regex demo.
Upvotes: 1
Reputation: 210842
In [246]: x.message.str.extractall(r',\s*(\w+):\s*([^,]*)').reset_index(level=0, drop=True)
Out[246]:
0 1
match
0 Duration 15h:22m:33s
0 Duration 18h:22m:33s
Upvotes: 0