Reputation: 2326
I'm trying to parse values of a cookie like this:
import re
m = re.search("(.*?)=(.*?); path=(.*?); domain=(.*?)", "name=value1; path=/; domain=my.domain.com")
print (m.group(0))
Result I get is like this:
name=value1; path=/; domain=
My question is: why does it not match at the last non-greedy position? Expected result would be:
name=value1; path=/; domain=my.domain.com
Of course, I could change to greedy mode or use an end of line character ($
) but I'd like to understand why it's not working like I expected it to work :)
Upvotes: 0
Views: 792
Reputation: 626748
Your last (.*?)
matches as few characters as possible. To match the rest of the cookie, you must set a lookahead, or match the known characters.
Here is a lookahead solution:
(.*?)=(.*?); path=(.*?); domain=(.*?)(?=;\s|$)
See demo
BTW, regex101 is very helpful to get a gist of what is behind the scenes of a regex: go to regex debugger and click the +
on the right, and you'll see what exactly happens when your regex comes to the last (.*?)
:
So, that is what I said in the beginning: matching as few as possible. And it matched an empty string after the =
sign, the rest can be "given away" since this is what lazy matching does.
The standard quantifiers in regular expressions are greedy, meaning they match as much as they can, only giving back as necessary to match the remainder of the regex.
By using a lazy quantifier, the expression tries the minimal match first.
Upvotes: 1
Reputation: 54173
The other answers do a great job explaining why your code doesn't work as-is. I'll just point out that honestly you should probably be matching non-space characters greedily, rather than matching all characters non-greedily.
re_obj = re.compile(r"""
(\S*)=(\S*);\s* # capture unknown key/value pair
path=(\S*);\s* # capture path
domain=(\S*) # capture domain""", re.X)
DEMO
>>> result = re_obj.search("name=value1; path=/; domain=my.domain.com")
>>> result.groups()
('name', 'value1', '/', 'my.domain.com')
And even more to the point, this seems easier to do with string operations than anything
txt = "name=value1; path=/; domain=my.domain.com"
parameters = {key.strip(): value.strip() for parm in txt.split(';') for
key,value in (parm.strip().split('='),)}
Upvotes: 0
Reputation: 251353
Non-greedy means it will match as little as it can while still allowing the entire match to succeed. *
means "zero or more". So the least it can match is zero. So it matches zero and the match succeeds.
The other occurrences of .*?
in your regex cannot match zero, because then the entire regex will fail to match.
Upvotes: 2