Reputation: 33
A beginner's question. In the code:
$a = 'aaagggaaa';
(@b) = ($a =~ /(a.+)(g.+)/);
print "$b[0]\n";
Why is $b[0]
equal to aaagg
and not aaa
? In other words - why second group - (g.+)
- matches only from last g
?
Upvotes: 0
Views: 131
Reputation: 1980
Perl regular expressions normally match the longest string possible.
In your code it matches with the last g
and returns the output aaagg. If you want to get the output as aaa, then you need to use the non-greedy behavior. Use this code:
$a = 'aaagggaaa';
(@b) = ($a =~ /(a.+?)(g.+)/);
print "$b[0]\n";
It will output:
aaa
Clearly, the use of the question mark
makes the match ungreedy.
Upvotes: 1
Reputation: 12012
Because the first .+
is "greedy", which means that it will try to match as many characters as possible.
If you want to turn out this "greedy" behaviour, you may replace .+
by .+?
, so /(a.+?)(g.+)/
will return ( 'aaa', 'gggaaa').
Maybe, you've wanted to write /(a+)(g+)/
(only 'a's in first group, and 'g's in second one).
Upvotes: 3
Reputation: 34130
The problem is that the first .+
is causing the g
to be matched as far to the right as possible.
To show you what is really happening I modified your code to output more illustrative debug information.
$ perl -Mre=debug -e'q[aaagggaaa] =~ /a.+[g ]/'
Compiling REx "a.+[g ]"
Final program:
1: EXACT <a> (3)
3: PLUS (5)
4: REG_ANY (0)
5: ANYOF[ g][] (16)
16: END (0)
anchored "a" at 0 (checking anchored) minlen 3
Guessing start of match in sv for REx "a.+[g ]" against "aaagggaaa"
Found anchored substr "a" at offset 0...
Guessed: match at offset 0
Matching REx "a.+[g ]" against "aaagggaaa"
0 <> <aaagggaaa> | 1:EXACT <a>(3)
1 <a> <aagggaaa> | 3:PLUS(5)
REG_ANY can match 8 times out of 2147483647...
9 <aaagggaaa> <> | 5: ANYOF[ g][](16)
failed...
8 <aaagggaa> <a> | 5: ANYOF[ g][](16)
failed...
7 <aaaggga> <aa> | 5: ANYOF[ g][](16)
failed...
6 <aaaggg> <aaa> | 5: ANYOF[ g][](16)
failed...
5 <aaagg> <gaaa> | 5: ANYOF[ g][](16)
6 <aaaggg> <aaa> | 16: END(0)
Match successful!
Freeing REx: "a.+[g ]"
Notice that the first .+
is capturing everything it can to start out with.
Then it has to backtrack until the g
can be matched.
What you probably want is one of:
/( a+ )( g+ )/x;
/( a.+? )( g.+ )/x;
/( a+ )( g.+ )/x;
/( a[^g]+ )( g.+ )/x;
/( a[^g]+ )( g+ )/x;
# etc.
Without more information from you, it is impossible to know what regex you want is.
Really regular expressions are a language in their own right, that is more complicated than the rest of Perl.
Upvotes: 1
Reputation: 7610
Usually a regex expression is greedy. You can turn it off using ?
character:
$a = 'aaagggaaa';
my @b = ($a =~ /(a.+)(g.+)/);
my @c = ($a =~ /(a.+?)(g.+)/);
print "@b\n";
print "@c\n";
Output:
aaagg gaaa
aaa gggaaa
But I'm not sure this is what You want! What about abagggbb
? You need aba
?
Upvotes: 0
Reputation: 13792
The regular expression you wrote:
($a =~ /(a.+)(g.+)/);
catchs the "a"
and any word as it can, finishing in one "g"
followed by more characters. So the first (a.+)
just matches "aaagg"
until the match of the second part of your regular expression: (g.+)
=> "gaaa"
The @b
array receives the two matches "aaagg"
and "gaaa"
. So, $b[0]
just prints "aaagg"
.
Upvotes: 1