Ole Tange
Ole Tange

Reputation: 33725

Perl: Method to convert regexp with greedy quantifiers to non-greedy

My user gives a regexp with quantifiers that default to being greedy. He can give any valid regexp. So the solution will have to deal with anything that the user can throw at me.

How do I convert the regexp so any greedy quantifier will be non-greedy?

Does Perl have a (?...:regexp) construct that forces the greedy default for quantifiers into a non-greedy one?

If not: Is there a different way I can force a regexp with greedy quantifiers into a non-greedy one?

E.g., a user may enter:

.*
[.*]
[.*]{4,10}
[.*{4,10}]{4,10}

While these four examples may look similar, they have completely different meanings.

If you simply add ? after every */} you will change the character sets in the last three examples.

Instead they should be changed to/behave like:

.*?
[.*]
[.*]{4,10}?
[.*{4,10}]{4,10}?

but where the matched string is the minimal match, and not first-match, that Perl will default to:

$a="aab";

$a=~/(a.*?b)$/;
# Matches aab, not ab
print $1;

But given the non-greedy regexp, the minimal match can probably be obtained by prepending .*:

$a="aab";

$a=~/.*(a.*?b)$/;
# Matches ab
print $1;

Upvotes: 0

Views: 392

Answers (2)

Andy A.
Andy A.

Reputation: 1452

You can use a state machine:

#!/usr/bin/perl

use strict;
use warnings;

my @regexes = ( ".*", "[.*]", "[.*]{4,10}", "[.*{4,10}]{4,10}" );

for (@regexes) {
    print "give: $_\n";
    my $ungreedy = make_ungreedy($_,0);
    print "got:  $ungreedy\n";
    print "============================================\n"
}


sub make_ungreedy {
    my $regex = shift;

    my $class_state  = 0;
    my $escape_state = 0;
    my $found        = 0;
    my $ungreedy     = "";

    for (split (//, $regex)) {
        if ($found) {
            $ungreedy .= "?" unless (/\?/);
            $found = 0;
        }
        $ungreedy .= $_;

        $escape_state = 0, next if ($escape_state);
        $escape_state = 1, next if (/\\/);
        $class_state  = 1, next if (/\[/);
        if ($class_state) {
            $class_state = 0 if (/\]/);
            next;
        }
        $found = 1 if (/[*}+]/);
    }
    $ungreedy .= '?' if $found;
    return $ungreedy;
}

Upvotes: 4

Alex Shesterov
Alex Shesterov

Reputation: 27565

"Greedyness" is not a property of the whole regular expression. It's a property of a quantifier.

It can be controlled for each quantifier separately. Just add a ? after a quantifier to make it non-greedy, e.g.

[a-z]*?

a{2,3}?

[0-9]??

\s+?

And no, there isn't any built-in way to turn the whole regex to some "default-non-greedy" mode. You need to parse the regex, detect all quantifiers and change them accordingly. Maybe there's a regex-parsing library somewhere on CPAN.


The closest I've found so far is the Regexp::Parser module. I didn't try it, but looks like it could parse the regex, walk the tree, make appropriate changes and then build a modified regex. Please take a look.

Upvotes: 7

Related Questions