Perl regex finding string containing keywords in order

Question

I am learning Perl. I want to find all occurences of 3 keywords in this order: keyword1, keyword2 and keyword3 in a text. keyword1 and keyword3 are optional. It can have up to 6 words between keywords. This is the code in Perl:

#!/usr/bin/perl
$reg="(keyword1)*\W*(?:\w+\W+){0,6}?(keyword2)\W*(?:\w+\W+){0,6}?(keyword3)*";
$content="some words before keyword1 optional word here then keyword2 again optional words then keyword3 others words after.";
while ($content=~m/$reg/g) {
    print "$&
";
}

I want to extract only the substring keyword1 optional word here then keyword2 again optional words then keyword3 but I got keyword2. Thank you.

ikegami · Accepted Answer

First of all, "\w" produces the string w, and "\W" produces the string W.

$ perl -wE'say "\w\W"'
Unrecognized escape \w passed through at -e line 1.
Unrecognized escape \W passed through at -e line 1.
wW

You need to escape the backslash ("\W") or use qr// (qr/\W/).

I'm pretty sure there are other problems with the pattern. I'm going to start from scratch.

Assumes k1 and k3 are both independently optional, you want:

qr/
    (?: \b k1 \W+
        (?: \w+ \W+ ){0,6}?
    )?

    \b k2 \b

    (?: 
        (?: \W+ \w+ ){0,6}?
        \W+ k3 \b
    )?
/x

The word boundaries (\b) are there to ensure that we don't match abck2def or abck1 k2 k3def.

The above is inefficient.

Take for example the following regex pattern:

(?: x y )? x z

It can match the following strings:

xyxz
xz

Notice how both start with x? That means a better pattern (i.e. one that performs less backtracking) would be

x (?: y x )? z

There are a couple of instances of this anti-pattern in the above answer. So let's refactor.

qr/
    \b
    (?: k1 \W+ (?: \w+ \W+ ){0,6}? \b )?
    k2 \b
    (?: \W+ (?: \w+ \W+ ){0,6}? k3 \b  )?
/x

Now we have something efficient.

In the above pattern, notice that the second \b is redundant. So let's get rid of it.

If we add a \b to the very end, the third and fourth \b become redundant.

After applying those simplifications, we get:

qr/
    \b
    (?: k1 \W+ (?: \w+ \W+ ){0,6}? )?
    k2
    (?: \W+ (?: \w+ \W+ ){0,6}? k3 )?
    \b
/x

Personally, I strongly dislike the non-greediness modifier as anything but a optimization. Furthermore, the use of two of them is normally a giant red flag that there is a bug in the pattern. For example, the pattern can match k1 k1 k2, but that may not be desirable.

To eliminate them, we need to ensure the first \w+ doesn't match k1 or k2. This can be achieved by replacing

\b \w+ \b

with

(?! \b k1 \b ) (?! \b k2 \b ) \b \w+ \b

Again, we factor out common prefixes to get:

\b (?! (?: k2 | k3 ) \b ) \w+ \b

Similarly, we need to ensure that the second \w+ doesn't match k2 or k3.

With these changes, we get:

qr/
    \b
    (?: k1 \W+ (?: (?! (?: k1 | k2 ) \b ) \w+ \W+ ){0,6} )?
    k2
    (?: \W+ (?: (?! (?: k2 | k3 ) \b ) \w+ \W+ ){0,6} k3 )?
    \b
/x

Complicated? yes. A better solution might start by breaking down the stream into word and non-word tokens. The advantage of this is that we don't have to worry about boundaries anymore.

my @tokens = split(/(\W+)/, $content, -1);

Then, the array is checked for the pattern. Since the regex engine is particular adept at doing this, we can leverage it as follows:

my $tokens =
   join '',
      map {
         ($_ % 2) ? "W"
         : $words[$_] eq "k1" ? 1
         : $words[$_] eq "k2" ? 2
         : $words[$_] eq "k3" ? 3
         : "w"                      # Non-key word
      }
         0..$#tokens;

while ($tokens =~ /(?: 1 W (?: w W ){0,6} )? 2 (?: W (?: w W ){0,6} 3 )?/xg) {
   say join('', @tokens[ $-[0] .. $+[0] - 1 ]);
}

Given the that @tokens will always be of the form word, non-word, word, non-word, etc, we can also use the following:

my $words =
   join '',
      map {
         ($_ % 2) ? ""              # We just want to look at the words
         : $words[$_] eq "k1" ? 1
         : $words[$_] eq "k2" ? 2
         : $words[$_] eq "k3" ? 3
         : "w"                      # Non-key word
      }
         0..$#tokens;

while ($words =~ /(?: 1 w{0,6} )? 2 (?: w{0,6} 3 )?/xg) {
   say join('', @tokens[ $-[0] * 2 .. ( $+[0] - 1 ) * 2 ]);
}

Perl regex finding string containing keywords in order

Answers (1)

Related Questions