Evgeni Bolotin
Evgeni Bolotin

Reputation: 33

overlapping pattern matching in Perl

A beginner's question. In the code:

$a = 'aaagggaaa';

(@b) = ($a =~ /(a.+)(g.+)/);

print "$b[0]\n";

Why is $b[0] equal to aaagg and not aaa? In other words - why second group - (g.+) - matches only from last g ?

Upvotes: 0

Views: 131

Answers (5)

Futuregeek
Futuregeek

Reputation: 1980

Perl regular expressions normally match the longest string possible.

In your code it matches with the last g and returns the output aaagg. If you want to get the output as aaa, then you need to use the non-greedy behavior. Use this code:

$a = 'aaagggaaa';
(@b) = ($a =~ /(a.+?)(g.+)/);
print "$b[0]\n";

It will output:

aaa

Clearly, the use of the question mark makes the match ungreedy.

Upvotes: 1

Orabîg
Orabîg

Reputation: 12012

Because the first .+ is "greedy", which means that it will try to match as many characters as possible.
If you want to turn out this "greedy" behaviour, you may replace .+ by .+?, so /(a.+?)(g.+)/ will return ( 'aaa', 'gggaaa').

Maybe, you've wanted to write /(a+)(g+)/ (only 'a's in first group, and 'g's in second one).

Upvotes: 3

Brad Gilbert
Brad Gilbert

Reputation: 34130

The problem is that the first .+ is causing the g to be matched as far to the right as possible.
To show you what is really happening I modified your code to output more illustrative debug information.

$ perl -Mre=debug -e'q[aaagggaaa] =~ /a.+[g ]/'
Compiling REx "a.+[g ]"
Final program:
   1: EXACT <a> (3)
   3: PLUS (5)
   4:   REG_ANY (0)
   5: ANYOF[ g][] (16)
  16: END (0)
anchored "a" at 0 (checking anchored) minlen 3 
Guessing start of match in sv for REx "a.+[g ]" against "aaagggaaa"
Found anchored substr "a" at offset 0...
Guessed: match at offset 0
Matching REx "a.+[g ]" against "aaagggaaa"
   0 <> <aaagggaaa>          |  1:EXACT <a>(3)
   1 <a> <aagggaaa>          |  3:PLUS(5)
                                  REG_ANY can match 8 times out of 2147483647...
   9 <aaagggaaa> <>          |  5:  ANYOF[ g][](16)
                                    failed...
   8 <aaagggaa> <a>          |  5:  ANYOF[ g][](16)
                                    failed...
   7 <aaaggga> <aa>          |  5:  ANYOF[ g][](16)
                                    failed...
   6 <aaaggg> <aaa>          |  5:  ANYOF[ g][](16)
                                    failed...
   5 <aaagg> <gaaa>          |  5:  ANYOF[ g][](16)
   6 <aaaggg> <aaa>          | 16:  END(0)
Match successful!
Freeing REx: "a.+[g ]"

Notice that the first .+ is capturing everything it can to start out with.
Then it has to backtrack until the g can be matched.


What you probably want is one of:

/( a+     )( g+  )/x;
/( a.+?   )( g.+ )/x;
/( a+     )( g.+ )/x;
/( a[^g]+ )( g.+ )/x;
/( a[^g]+ )( g+  )/x;
# etc.

Without more information from you, it is impossible to know what regex you want is.

Really regular expressions are a language in their own right, that is more complicated than the rest of Perl.

Upvotes: 1

TrueY
TrueY

Reputation: 7610

Usually a regex expression is greedy. You can turn it off using ? character:

$a = 'aaagggaaa';
my @b = ($a =~ /(a.+)(g.+)/);
my @c = ($a =~ /(a.+?)(g.+)/);
print "@b\n";
print "@c\n";

Output:

aaagg gaaa
aaa gggaaa

But I'm not sure this is what You want! What about abagggbb? You need aba?

Upvotes: 0

Miguel Prz
Miguel Prz

Reputation: 13792

The regular expression you wrote:

($a =~ /(a.+)(g.+)/);

catchs the "a" and any word as it can, finishing in one "g" followed by more characters. So the first (a.+) just matches "aaagg" until the match of the second part of your regular expression: (g.+) => "gaaa"

The @b array receives the two matches "aaagg" and "gaaa". So, $b[0] just prints "aaagg".

Upvotes: 1

Related Questions