Reputation: 321
I am learning Perl. I want to find all occurences of 3 keywords in this order: keyword1
, keyword2
and keyword3
in a text. keyword1
and keyword3
are optional. It can have up to 6 words between keywords. This is the code in Perl:
#!/usr/bin/perl
$reg="(keyword1)*\W*(?:\w+\W+){0,6}?(keyword2)\W*(?:\w+\W+){0,6}?(keyword3)*";
$content="some words before keyword1 optional word here then keyword2 again optional words then keyword3 others words after.";
while ($content=~m/$reg/g) {
print "$&\n";
}
I want to extract only the substring keyword1 optional word here then keyword2 again optional words then keyword3
but I got keyword2
. Thank you.
Upvotes: 0
Views: 136
Reputation: 385590
First of all, "\w"
produces the string w
, and "\W"
produces the string W
.
$ perl -wE'say "\w\W"'
Unrecognized escape \w passed through at -e line 1.
Unrecognized escape \W passed through at -e line 1.
wW
You need to escape the backslash ("\\W"
) or use qr//
(qr/\W/
).
I'm pretty sure there are other problems with the pattern. I'm going to start from scratch.
Assumes k1
and k3
are both independently optional, you want:
qr/
(?: \b k1 \W+
(?: \w+ \W+ ){0,6}?
)?
\b k2 \b
(?:
(?: \W+ \w+ ){0,6}?
\W+ k3 \b
)?
/x
The word boundaries (\b
) are there to ensure that we don't match abck2def
or abck1 k2 k3def
.
The above is inefficient.
Take for example the following regex pattern:
(?: x y )? x z
It can match the following strings:
xyxz
xz
Notice how both start with x
? That means a better pattern (i.e. one that performs less backtracking) would be
x (?: y x )? z
There are a couple of instances of this anti-pattern in the above answer. So let's refactor.
qr/
\b
(?: k1 \W+ (?: \w+ \W+ ){0,6}? \b )?
k2 \b
(?: \W+ (?: \w+ \W+ ){0,6}? k3 \b )?
/x
Now we have something efficient.
In the above pattern, notice that the second \b
is redundant. So let's get rid of it.
If we add a \b
to the very end, the third and fourth \b
become redundant.
After applying those simplifications, we get:
qr/
\b
(?: k1 \W+ (?: \w+ \W+ ){0,6}? )?
k2
(?: \W+ (?: \w+ \W+ ){0,6}? k3 )?
\b
/x
Personally, I strongly dislike the non-greediness modifier as anything but a optimization. Furthermore, the use of two of them is normally a giant red flag that there is a bug in the pattern. For example, the pattern can match k1 k1 k2
, but that may not be desirable.
To eliminate them, we need to ensure the first \w+
doesn't match k1
or k2
. This can be achieved by replacing
\b \w+ \b
with
(?! \b k1 \b ) (?! \b k2 \b ) \b \w+ \b
Again, we factor out common prefixes to get:
\b (?! (?: k2 | k3 ) \b ) \w+ \b
Similarly, we need to ensure that the second \w+
doesn't match k2
or k3
.
With these changes, we get:
qr/
\b
(?: k1 \W+ (?: (?! (?: k1 | k2 ) \b ) \w+ \W+ ){0,6} )?
k2
(?: \W+ (?: (?! (?: k2 | k3 ) \b ) \w+ \W+ ){0,6} k3 )?
\b
/x
Complicated? yes. A better solution might start by breaking down the stream into word and non-word tokens. The advantage of this is that we don't have to worry about boundaries anymore.
my @tokens = split(/(\W+)/, $content, -1);
Then, the array is checked for the pattern. Since the regex engine is particular adept at doing this, we can leverage it as follows:
my $tokens =
join '',
map {
($_ % 2) ? "W"
: $words[$_] eq "k1" ? 1
: $words[$_] eq "k2" ? 2
: $words[$_] eq "k3" ? 3
: "w" # Non-key word
}
0..$#tokens;
while ($tokens =~ /(?: 1 W (?: w W ){0,6} )? 2 (?: W (?: w W ){0,6} 3 )?/xg) {
say join('', @tokens[ $-[0] .. $+[0] - 1 ]);
}
Given the that @tokens
will always be of the form word, non-word, word, non-word, etc, we can also use the following:
my $words =
join '',
map {
($_ % 2) ? "" # We just want to look at the words
: $words[$_] eq "k1" ? 1
: $words[$_] eq "k2" ? 2
: $words[$_] eq "k3" ? 3
: "w" # Non-key word
}
0..$#tokens;
while ($words =~ /(?: 1 w{0,6} )? 2 (?: w{0,6} 3 )?/xg) {
say join('', @tokens[ $-[0] * 2 .. ( $+[0] - 1 ) * 2 ]);
}
Upvotes: 2