johnnyb
johnnyb

Reputation: 1815

pandas extractall matching

How can I match the below with a pandas extractall regex:

stringwithinmycolumn
stuff, Duration: 15h:22m:33s, notstuff,
stuff, Duration: 18h:22m:33s, notstuff,

Currently, I am using the below:

df.message.str.extractall(r',([^,]*?): ([^,:]*?,').reset_index()

Expected output:

              0              1
match    
    0  Duration    15h:22m:33s
    1  Duration    18h:22m:33s

I am not able to match so far.

Upvotes: 0

Views: 628

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626870

You may use

,\s*([^,:]+):\s*([^,]+),

See the regex demo

It matches:

  • , - a comma
  • \s* - 0+ whitespaces
  • ([^,:]+) - Group 1: - 0+ chars other than , and :
  • : - a colon
  • \s* - 0+ whitespaces
  • ([^,]+) - Group 2: one or more chars other than ,
  • , - a comma (this actually can be removed, but may stay to ensure safer matching.)

Note that you may consider making your regex more precise when you need to extract structured information from long strings. So, you may want to use letter matching pattern to match Duration, and only digits, colon, h, m or s to extract the time value. So, the pattern will become a bit more verbose:

,\s*([A-Za-z]+):\s*([\d:hms]+)

but much safer. See another regex demo.

Upvotes: 1

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210842

In [246]: x.message.str.extractall(r',\s*(\w+):\s*([^,]*)').reset_index(level=0, drop=True)
Out[246]:
              0            1
match
0      Duration  15h:22m:33s
0      Duration  18h:22m:33s

Upvotes: 0

Related Questions