Reputation: 89
While writing a regex pattern for substituting all the continuos '1's and single '1's as 's'. I found this quite confusing ,usage of '+' (used for matching 1 or more ) gave expected result, but '*' gave strange result
>>> l='100'
>>> import re
>>> j=re.compile(r'(1)*')
>>> m=j.sub('*',l)
>>> m
'*0*0*'
While usage of '+' gave expected result.
>>> l='100'
>>> j=re.compile(r'1+')
>>> m=j.sub('*',l)
>>> m
'*00'
how does '*' in regex gives this, while its behaviour is to match 0 or more.
Upvotes: 6
Views: 190
Reputation: 8833
(1)*
means "match 0 or more 1's". So for 100
it matches the 1, the empty string between 0 and 0, and the empty string after the last 0. You then replace the empty strings with '*'. 1+ requires at least one 1 in the match, so it won't match the boundary between characters.
For those readers curious, yes the python output is *0*0*
and not **0*0*
. Here is a test python script to play with. (Regex101 has the wrong output for this, because it does not use an actual python regex engine. Online Regex testers will usually use PCRE (which is provided in PHP and Apache HTTP Server), and fake the target regex engine. Always test your regex in live code!)
Here you can see in JavaScript the output will be **0*0*
(it will match the empty string between 1 and 0 as a new match) This is a prime example of why 'regex flavor' is important. Different engines use slightly different rules. (in this case, if the new match starts at 0 or the character boundary)
console.log("100".replace(/(1)*/g, '*'))
Upvotes: 6
Reputation: 48711
regex = r"1*"
p = re.compile(regex)
test_str = "100"
for m in p.finditer(test_str):
print(m.start(), m.group())
Outputs 4 matches (that's why regex101 shows 4 matches):
0 1
1
2
3
While re.sub()
replaces 3 positions which is a cause of re.sub()
advancing after a zero-length match (Python doc):
sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
...
Empty matches for the pattern are replaced only when not adjacent to a previous match, so
sub('x*', '-', 'abc')
returns'-a-b-c-'
.
What does a non-overlapping occurrence mean? It means when:
the first match ended at the start of the string, where the first match attempt began. The regex engine needs a way to avoid getting stuck in an infinite loop that forever finds the same zero-length match at the start of the string.
The simplest solution, which is used by most regex engines, is to start the next match attempt one character after the end of the previous match, if the previous match was zero-length.
In this case, the second match attempt begins at the position between the 1 and the 0 in the string, hence the difference.
Upvotes: 2
Reputation: 385655
Beware of patterns that can match nothing. This isn't well-defined, so the behaviour varies by engine. For example, you get a different result in Perl.
$ perl -e'CORE::say "100" =~ s/1*/\*/rg'
**0*0*
Upvotes: 2