jonderry
jonderry

Reputation: 23633

Find all possible starting positions of a regular expression match in perl, including overlapping matches?

Is there a way to find all possible start positions for a regex match in perl?

For example, if your regex was "aa" and the text was "aaaa", it would return 0, 1, and 2, instead of, say 0 and 2.

Obviously, you could just do something like return the first match, and then delete all characters up to and including that starting character, and perform another search, but I'm hoping for something more efficient.

Upvotes: 4

Views: 1742

Answers (4)

Eric Strom
Eric Strom

Reputation: 40142

Update:

I thought about this one a bit more, and came up with this solution using an embedded code block, which is nearly three times faster than the grep solution:

use 5.010;
use warnings;
use strict;

{my @pos;
 my $push_pos = qr/(?{push @pos, $-[0]})/;

sub with_code {
    my ($re, $str) = @_;
    @pos = ();
    $str =~ /(?:$re)$push_pos(?!)/;
    @pos
}}

and for comparison:

sub with_grep {  # old solution
    my ($re, $str) = @_;
    grep {pos($str) = $_; $str =~ /\G(?:$re)/} 0 .. length($str) - 1;
}

sub with_while { # per Michael Carman's solution, corrected
    my ($re, $str) = @_;
    my @pos;
    while ($str =~ /\G.*?($re)/) {
        push @pos, $-[1];
        pos $str = $-[1] + 1
    }
    @pos
}

sub with_look_ahead {  # a fragile "generic" version of Sean's solution
    my ($re, $str) = @_;
    my ($re_a, $re_b) = split //, $re, 2;
    my @pos;
    push @pos, $-[0] while $str =~ /$re_a(?=$re_b)/g;
    @pos
}

Benchmarked and sanity checked with:

use Benchmark 'cmpthese';

my @arg = qw(aa aaaabbbbbbbaaabbbbbaaa);
my $expect = 7;

for my $sub qw(grep while code look_ahead) {
    no strict 'refs';
    my @got = &{"with_$sub"}(@arg);
    "@got" eq '0 1 2 11 12 19 20' or die "$sub: @got";
}

cmpthese -2 => {
    grep  => sub {with_grep      (@arg) == $expect or die},
    while => sub {with_while     (@arg) == $expect or die},
    code  => sub {with_code      (@arg) == $expect or die},
    ahead => sub {with_look_ahead(@arg) == $expect or die},
};

Which prints:

          Rate  grep while ahead  code
grep   49337/s    --  -20%  -43%  -65%
while  61293/s   24%    --  -29%  -56%
ahead  86340/s   75%   41%    --  -38%
code  139161/s  182%  127%   61%    --

Upvotes: 2

Joel Berger
Joel Berger

Reputation: 20280

I know you asked for a regex, but there is actually a simple builtin function that does something quite similar, the function index (perldoc -f index). From that we can build up a simple solution to your direct question, though if you really need a more complicated search than your example this will not work as it only looks for substrings (after an index given by the third parameter).

#!/usr/bin/env perl

use strict;
use warnings;

my $str = 'aaaa';
my $substr = 'aa';

my $pos = -1;
while (1) {
  $pos = index($str, $substr, $pos + 1);
  last if $pos < 0;
  print $pos . "\n";
}

Upvotes: 1

Eugene Yarmash
Eugene Yarmash

Reputation: 149736

You can use global matching with the pos() function:

my $s1 = "aaaa";
my $s2 = "aa";

while ($s1 =~ /aa/g) {
    print pos($s1) - length($s2), "\n";
}

Upvotes: 0

Sean
Sean

Reputation: 29772

Use lookahead:

$ perl -le 'print $-[0] while "aaaa" =~ /a(?=a)/g'

In general, put everything except the first character of the regex inside of the (?=...).

Upvotes: 6

Related Questions