Reputation: 8895
The following prints ac | a | bbb | c
#!/usr/bin/env perl
use strict;
use warnings;
# use re 'debug';
my $str = 'aacbbbcac';
if ($str =~ m/((a+)?(b+)?(c))*/) {
print "$1 | $2 | $3 | $4\n";
}
It seems like failed matches do not reset the captured group variables. What am I missing?
Upvotes: 15
Views: 57333
Reputation: 132920
I know that you are looking at this academically, but in general, a quantifier at the end of a match is often a code smell. Not only that, a zero-or-more quantifier is even smellier. That pattern will never fail to match because it can always match the zero-times case:
use strict;
use warnings;
my $str = 'xzy';
if ($str =~ m/((a+)?(b+)?(c))*/) {
print "matched: $1 | $2 | $3 | $4\n";
}
Gives some warnings, but still matches:
matched: | | |
Use of uninitialized value $1 in concatenation (.) or string at .
Use of uninitialized value $2 in concatenation (.) or string at .
Use of uninitialized value $3 in concatenation (.) or string at .
Use of uninitialized value $4 in concatenation (.) or string at .
Even if this was changed to +
for one-or-more matches, that really means you are usually looking for only the last match. In that case, you should rewrite your pattern to find only the last case and not pollute the per-match variables with previous matches. I don't see a good way to do that in this abstract, contextless situation, but maybe it's a global match in scalar context:
use strict;
use warnings;
my $str = 'aacbbbcac';
while($str =~ m/(a+)?(b+)?(c)/g ) {
print "$& | $1 | $2 | $3 | $4\n";
}
The output does carry the same baggage between matches as before because these are now separate successful matches:
aac | aa | | c |
bbbc | | bbb | c |
ac | a | | c |
Use of uninitialized value $2 in concatenation (.) or string at .
Use of uninitialized value $4 in concatenation (.) or string at .
Use of uninitialized value $1 in concatenation (.) or string at .
Use of uninitialized value $4 in concatenation (.) or string at .
Use of uninitialized value $2 in concatenation (.) or string at .
Use of uninitialized value $4 in concatenation (.) or string at .
Note that you can't use \G
here because the matches overlap.
Upvotes: 0
Reputation: 27252
As odd as it seems this is the "expected" behavior. Here's a quote from the perlre docs:
NOTE: Failed matches in Perl do not reset the match variables, which makes it easier to write code that tests for a series of more specific cases and remembers the best match.
Upvotes: 3
Reputation: 213391
it seems like failed matches dont reset the captured group variables
There is no failed matches in there. Your regex matches the string fine. Although there are some failed matches for inner groups in some repetition. Each matched group might be overwritten by the next match found for that particular group, or keep it's value from previous match, if that group is not matched in current repetition.
Let's see how regex match proceeds:
First (a+)?(b+)?(c)
matches aac
. Since (b+)?
is optional, that will not be matched. At this stage, each capture group contains following part:
$1
contains entire match - aac
$2
contains (a+)?
part - aa
$3
contains (b+)?
part - null
.$4
contains (c)
part - c
Since there is still some string left to match - bbbcac
. Proceeding further - (a+)?(b+)?(c)
matches - bbbc
. Since (a+)?
is optional, that won't be matched.
$1
contains entire match - bbbc
. Overwrites the previous value in $1
$2
doesn't match. So, it will contain text previously matched - aa
$3
this time matches. It contains - bbb
$4
matches c
Again, (a+)?(b+)?(c)
will go on to match the last part - ac
.
$1
contains entire match - ac
. $2
matches a
this time. Overwrites the previous value in $2
. It now contains - a
$3
doesn't matches this time, as there is no (b+)?
part. It will be same as previous match - bbb
$4
matches c
. Overwrites the value from previous match. It now contains - c
.Now, there is nothing left in the string to match. The final value of all the capture groups are:
$1
- ac
$2
- a
$3
- bbb
$4
- c
.Upvotes: 23