Krab
Krab

Reputation: 6756

Perl 5 - longest token matching in regexp (using alternation)

Is possible to force a Perl 5 regexp match longest possible string, if the regexp is, for example:

a|aa|aaa

I found is probably default in perl 6, but in perl 5, how i can get this behavior?

EXAMPLE pattern:

[0-9]|[0-9][0-9]|[0-9][0-9][0-9][0-9]

If I have string 2.10.2014, then first match will be 2, which is ok; but the next match will be 1, and this is not ok because it should be 10. Then 2014 will be 4 subsequently matches 2,0,1,4, but it should be 2014 using [0-9][0-9][0-9][0-9]. I know I could use [0-9]+, but I can't.

Upvotes: 3

Views: 2537

Answers (4)

Ruud H.G. van Tol
Ruud H.G. van Tol

Reputation: 1

perl -Mstrict -Mre=/xp -MData::Dumper -wE'
  {package Data::Dumper;our($Indent,$Sortkeys,$Terse,$Useqq)=(1)x4}
  sub _dump { Dumper(shift) =~ s{(\[.*?\])}{$1=~s/\s+/ /gr}srge }
  my ($count, %RS);
  my $s= "aaaabbaaaaabbab";
  $s =~ m{ \G a+b? (?{ $RS{ $+[0] - $-[0] } //= [ ${^MATCH}, $-[0] ]; $count++ }) (*FAIL) };
  say sprintf "RS: %s", _dump(\%RS);
  say sprintf "count: %s", $count;
'
RS: {
  "1" => [ "a", 0 ],
  "2" => [ "aa", 0 ],
  "3" => [ "aaa", 0 ],
  "4" => [ "aaaa", 0 ],
  "5" => [ "aaaab", 0 ]
}

count: 5

Upvotes: -1

ikegami
ikegami

Reputation: 386331

General solution: Put the longest one first.

my ($longest) = /(aaa|aa|a)/

Specific solution: Use

my ($longest) = /([0-9]{4}|[0-9]{1,2})/

If you can't edit the pattern, you'll have to find every possibility and find the longest of them.

my $longest;
while (/([0-9]|[0-9][0-9]|[0-9][0-9][0-9][0-9])/g) {
   $longest = $1 if length($1) > length($longest);
}

Upvotes: 4

Borodin
Borodin

Reputation: 126742

The alternation will use the first alternative that matches, so just write /aaa|aa|a/ instead.

For the example you have shown in your question, just put the longest alternative first like I said:

[0-9][0-9][0-9][0-9]|[0-9][0-9]|[0-9]

Upvotes: 2

amon
amon

Reputation: 57640

The sanest solution I can see for unknown patterns is to match every possible pattern, look at the length of the matched substrings and select the longest substring:

my @patterns = (qr/a/, qr/a(a)/, qr/b/, qr/aaa/);
my $string = "aaa";

my @substrings = map {$string =~ /($_)/; $1 // ()} @patterns;

say "Matched these substrings:";
say for @substrings;

my $longest_token = (sort { length $b <=> length $a } @substrings)[0];

say "Longest token was: $longest_token";

Output:

Matched these substrings:
a
aa
aaa
Longest token was: aaa

For known patterns, one would sort them manually so that first-match is the same as longest-match:

"aaa" =~ /(aaa|aa|b|a)/;
say "I know that this was the longest substring: $1";

Upvotes: 2

Related Questions