Aron Solberg
Aron Solberg

Reputation: 6878

Python regular expressions acting strangely

url = "http://www.domain.com/7464535"
match = re.search(r'\d*',url)
match.group(0)

returns '' <----- empty string

but

url = "http://www.domain.com/7464535"
match = re.search(r'\d+',url)
match.group(0)

returns '7464535'

I thought '+' was supposed to be 1 or more and '*' was 0 or more correct? And RE is supposed to be greedy. So why don't they both return the same thing and more importantly why does the 1st one return nothing?

Upvotes: 3

Views: 120

Answers (1)

Scott Olson
Scott Olson

Reputation: 3542

You are correct about the meanings of + and *. So \d* will match zero or more digits — and that's exactly what it's doing. Starting at the beginning of the string, it matches zero digits, and then it's done. It successfully matched zero or more digits.

* is greedy, but that only means that it will match as many digits as it can at the place where it matches. It won't give up a match to try to find a longer one later in the string.


Edit: A more detailed description of what the regex engine does:

Take the case where our string to search is "http://www.domain.com/7464535" and the pattern is \d+.

In the beginning, the regex engine is pointing at the beginning of our URL and the beginning of the regex pattern. \d+ needs to match one or more digits, so first the regex engine must find at least one digit to have a successful match.

The first place it looks it finds an 'h' character. That's not a digit, so it moves on to the 't', then the next 't', and so on until it finally reaches the '7'. Now we've matched one digit, so the "one or more" requirement is satisfied and we could have a successful match, except + is greedy so it will match as many digits as it can without changing the starting point of the match, the '7'. So it hits the end of the string and matches that whole number '7464535'.

Now consider if our pattern was \d*. The only difference now is that zero digits is a valid match. Since regex matches left-to-right, the first place \d* will match is the very start of the string. So we have a zero-length match at the beginning, but since * is greedy, it will extend the match as long as there are digits. Since the first thing we find is 'h', a non-digit, it just returns the zero-length match.

How is * even useful, then, if it will just give you a zero-length match? Consider if I was matching a config file like this:

foo: bar
baz:   quux
blah:blah

I want to allow any amount of spaces (even zero) after the colon. I would use a regex like (\w+):\s*(\w+) where \s* matches zero or more spaces. Since it occurs after the colon in the pattern, it will match just after the colon in the string and then either match a zero-length string (as in the third line blah:blah because the 'b' after the colon ends the match) or all the spaces there are before the next non-space, because * is greedy.

Upvotes: 9

Related Questions