Juicy
Juicy

Reputation: 12520

Non-greedy regex not matching as expected

Given the following string as input:

[2015/06/09 14:21:59] mod=syn|cli=192.168.1.99/49244|srv=192.168.1.100/80|subj=cli|os=Windows 7 or 8|dist=0|params=none|raw_sig=4:128+0:0:1460:8192,8:mss,nop,ws,nop,nop,sok:df,id+:0

I'm trying to match the value of subj, ie: in the above case the expected output would be cli

I don't understand why my regex is not working:

subj = re.match(r"(.*)subj=(.*?)|(.*)", line).group(2)

From what I can tell, the second group in here should be cli but I'm getting an empty result.

Upvotes: 1

Views: 111

Answers (5)

nu11p01n73R
nu11p01n73R

Reputation: 26667

The | has special meaning in regex (Which creates alternations ) , hence escape it as

>> re.match(r"(.*)subj=(.*?)\|", line).group(2)
'cli'

Another Solution

You can use re.search() so that you can get rid of the groups at the start of subj and that after the |

Example

>>> re.search(r"subj=(.*?)\|", line).group(1)
'cli'

Here we use group(1) since there is only one group that is being captured instead of three as in previous version.


Complex version

You can even get rid of all the capturing if you are using look arounds

>>> re.search(r"(?<=subj=).*?(?=\|)", line).group(0)
'cli'
  • (?<=subj=) Checks if the string matched by .*? is preceded by subj.

  • .*? Matches anything, non greedy matching.

  • (?=\|) Check if this anything is followed by a |.

Upvotes: 3

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626903

I would use a negated class [^|]* with re.search for better performance:

import re
p = re.compile(r'^(.*)subj=([^|]*)\|(.*)$')
test_str = "[2015/06/09 14:21:59] mod=syn|cli=192.168.1.99/49244|srv=192.168.1.100/80|subj=cli|os=Windows 7 or 8|dist=0|params=none|raw_sig=4:128+0:0:1460:8192,8:mss,nop,ws,nop,nop,sok:df,id+:0"
print re.search(p, test_str).group(2)

See IDEONE demo

Note I am not using both lazy and greedy quantifiers in the regex (it is not advisable usually).

The pipe symbol must be escaped to be treated as a literal | symbol.

REGEX EXPLANATION:

  • ^ - Start of string
  • (.*) - The first capturing group that matches characters from the beginning up to
  • subj= - A literal string subj=
  • ([^|]*) - The second capturing group matching any characters other than a literal pipe (inside a character class, it does not need escaping)
  • \| - A literal pipe (must be escaped)
  • (.*) - The third capturing group (if you need to get the string after up to the end.
  • $ - End of string

Upvotes: 0

abc123
abc123

Reputation: 18783

Regex101

I'd recommend using the following regex, because it will provide better performance with two additions/substitutions:

  • adding the beginning of line character ^
  • adding the negating group [^\|]* is faster than (.*)?

Code

subj = re.match(r"^.*\|subj=([^\|]*)", line).group(1)

regex:

^.*\|subj=([^\|]*)

Regular expression visualization

Debuggex Demo

Upvotes: 2

yvespeirsman
yvespeirsman

Reputation: 3099

The pipe sign | needs to be escaped, like so:

subj = re.match(r"(.*)subj=(.*?)\|(.*)", s).group(2)

Upvotes: 1

karthik manchala
karthik manchala

Reputation: 13640

You need to escape |.. Use the following:

subj = re.match(r"(.*)subj=(.*?)\|(.*)", line).group(2)
                                ^

Upvotes: 2

Related Questions