Josh Klein
Josh Klein

Reputation: 305

Why is this regex not greedy?

In this regex

$line = 'this is a regular expression';
$line =~  s/^(\w+)\b(.*)\b(\w+)$/$3 $2 $1/;

print $line;

Why is $2 equal to " is a regular "? My thought process is that (.*) should be greedy and match all characters until the end of the line and therefore $3 would be empty.

That's not happening, though. The regex matcher is somehow stopping right before the last word boundary and populating $3 with what's after the last word boundary and the rest of the string is sent to $2.

Any explanation? Thanks.

Upvotes: 9

Views: 716

Answers (4)

verdesmarald
verdesmarald

Reputation: 11866

$3 can't be empty when using this regex because the corresponding capturing group is (\w+), which must match at least one word character or the whole match will fail.

So what happens is (.*) matches "is a regular expression", \b matches the end of the string, and (\w+) fails to match. The regex engine then backtracks to (.*) matching "is a regular " (note the match includes the space), \b matches the word boundary before e, and (\w+) matches "expression".

If you change(\w+) to (\w*) then you will end up with the result you expected, where (.*) consumes the whole string.

Upvotes: 15

Brad Gilbert
Brad Gilbert

Reputation: 34120

The way that you wrote your regexp it doesn't matter if .* is being greedy, or non-greedy. It will still match.

The reason is that you used \b between .* and \w+.

use strict;
use warnings;

my $string = 'this is a regular expression';

sub test{
  my($match,$desc) = @_;
  print '# ', $desc, "\n" if $desc;
  print "test( qr'$match' );\n";
  if( my @elem = $string =~ $match ){
    print ' 'x4,'[\'', join("']['",@elem), "']\n\n"
  }else{
    print ' 'x4,"FAIL\n\n";
  }
}

test( qr'^ (\w+) \b (.*)  \b (\w+) $'x, 'original' );
test( qr'^ (\w+) \b (.*+) \b (\w+) $'x, 'extra-greedy' );
test( qr'^ (\w+) \b (.*?) \b (\w+) $'x, 'non-greedy' );
test( qr'^ (\w+) \b (.*)  \b (\w*) $'x, '\w* instead of \w+' );
test( qr'^ (\w+) \b (.*)     (\w+) $'x, 'no \b');
test( qr'^ (\w+) \b (.*?)    (\w+) $'x, 'no \b, non-greedy .*?' );
# original
test( qr'(?^x:^ (\w+) \b (.*)  \b (\w+) $)' );
    ['this'][' is a regular ']['expression']

# extra-greedy
test( qr'(?^x:^ (\w+) \b (.*+) \b (\w+) $)' );
    FAIL

# non-greedy
test( qr'(?^x:^ (\w+) \b (.*?) \b (\w+) $)' );
    ['this'][' is a regular ']['expression']

# \w* instead of \w+
test( qr'(?^x:^ (\w+) \b (.*)  \b (\w*) $)' );
    ['this'][' is a regular expression']['']

# no \b
test( qr'(?^x:^ (\w+) \b (.*)     (\w+) $)' );
    ['this'][' is a regular expressio']['n']

# no \b, non-greedy .*?
test( qr'(?^x:^ (\w+) \b (.*?)    (\w+) $)' );
    ['this'][' is a regular ']['expression']

Upvotes: 0

Kevin
Kevin

Reputation: 56119

In order for the regex to match the whole string, ^(\w+)\b requires that the entire first word be \1. Likewise, \b(\w+)$ requires that the entire last word be \3. Therefore, no matter how greedy (.*) is, it can only capture ' is a regular ', otherwise the pattern won't match. At some point while matching the string, .* probably did take up the entire ' is a regular expression', but then it found that it had to backtrack and let the \w+ get its match too.

Upvotes: 1

daniel gratzer
daniel gratzer

Reputation: 53891

Greedy doesn't mean it gets to match absolutely everything. It just means it can take as much as possible and still have the regex succeed.

This means that since you use the + in group 3 it can't be empty and still succeed as + means 1 or more.

If you want 3 to be empty, just change (\w+) to (\w?). Now since ? means 0 or 1 it can be empty, and therefore the greedy .* takes everything. Note: This seems to work only in Perl, due to how perl deals with lines.

Upvotes: 6

Related Questions