jansohn
jansohn

Reputation: 2326

Non-greedy mode in re.search() does not match to end of string

I'm trying to parse values of a cookie like this:

import re
m = re.search("(.*?)=(.*?); path=(.*?); domain=(.*?)", "name=value1; path=/; domain=my.domain.com")
print (m.group(0))

Result I get is like this:

name=value1; path=/; domain=

My question is: why does it not match at the last non-greedy position? Expected result would be:

name=value1; path=/; domain=my.domain.com

Of course, I could change to greedy mode or use an end of line character ($) but I'd like to understand why it's not working like I expected it to work :)

Upvotes: 0

Views: 792

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626748

Your last (.*?) matches as few characters as possible. To match the rest of the cookie, you must set a lookahead, or match the known characters.

Here is a lookahead solution:

(.*?)=(.*?); path=(.*?); domain=(.*?)(?=;\s|$)

See demo

BTW, regex101 is very helpful to get a gist of what is behind the scenes of a regex: go to regex debugger and click the + on the right, and you'll see what exactly happens when your regex comes to the last (.*?):

enter image description here

So, that is what I said in the beginning: matching as few as possible. And it matched an empty string after the = sign, the rest can be "given away" since this is what lazy matching does.

The standard quantifiers in regular expressions are greedy, meaning they match as much as they can, only giving back as necessary to match the remainder of the regex.

By using a lazy quantifier, the expression tries the minimal match first.

Upvotes: 1

Adam Smith
Adam Smith

Reputation: 54173

The other answers do a great job explaining why your code doesn't work as-is. I'll just point out that honestly you should probably be matching non-space characters greedily, rather than matching all characters non-greedily.

re_obj = re.compile(r"""
    (\S*)=(\S*);\s*           # capture unknown key/value pair
    path=(\S*);\s*            # capture path
    domain=(\S*)              # capture domain""", re.X)

DEMO

>>> result = re_obj.search("name=value1; path=/; domain=my.domain.com")
>>> result.groups()
('name', 'value1', '/', 'my.domain.com')

And even more to the point, this seems easier to do with string operations than anything

txt = "name=value1; path=/; domain=my.domain.com"
parameters = {key.strip(): value.strip() for parm in txt.split(';') for
              key,value in (parm.strip().split('='),)}

Upvotes: 0

BrenBarn
BrenBarn

Reputation: 251353

Non-greedy means it will match as little as it can while still allowing the entire match to succeed. * means "zero or more". So the least it can match is zero. So it matches zero and the match succeeds.

The other occurrences of .*? in your regex cannot match zero, because then the entire regex will fail to match.

Upvotes: 2

Related Questions