Reputation: 959
I wrote a regex in a perl script to find and capture a word that contains the sequence "fp"
, "fd"
, "sp"
or "sd"
in a sentence. However, the word may contain some non-word characters like θ or ð. The word may be at the beginning or end of the sentence. When I tested this regex on regex101.com, it matches even when the input is nothing.The way I interpret this regex is: match one of the patterns "fp", "fd", "sp" or "sd" and capture everything around it until either a whitespace or the beginning of the line on the left side and a whitespace or end of the line on the right side.
This is the regex: ^|\s(.*[fs][ˈ|ˌ]?[pd].*)\s|$
I also tried using the ?
quantifier to make the .*
pattern lazy, but it still shows a match when the input is nothing.
Here are some examples of what I need it to capture in parentheses:
(fpgθ) tig <br/>
tig (gfpθ) tig<br/>
tig (gθfp)<br/>
Edit: I forgot to explain the middle part. The [ˈˌ]?
part (I made a mistake, I don't need the |
) just allows for those characters to be between the [fs]
and [pd]
. I wouldn't want it to match things like tigf pg
. I want it to match any word (defined by the space around it - so in a sentence like tig you rθð
the words it contains are tig
, you
, and rθð
). This "word" could be at the end, at the beginning, or in the middle of the sentence. Is there a way to assert the position at the beginning of the string within a bracket? I think that would solve my problem.
Also, I tried using \w
, but because I have things like θ
or ð
it doesn't match those.
Upvotes: 1
Views: 869
Reputation: 18950
find and capture a word that contains the sequence "fp", "fd", "sp" or "sd" in a sentence. However, the word may contain some non-word characters like θ or ð.
You should match Unicode letters \p{L}
instead of regular word characters \w
:
Click on the pattern to try it online. I have simplified the pattern according to your latest edits.
use warnings;
use strict;
use utf8;
use open ":std", ":encoding(UTF-8)";
my @regex = qr/\p{L}*[fs][pd]\p{L}*/mp;
my @strs = 'fpgθ tig <br/>
tig gfpθ tig<br/>
tig gθfp<br/>
fptig gfpθ tig<br/>
sddgsdθ(θ@) tig gθfp<br/>';
for (@strs)
{
my @m = /@regex/gm;
print "@m" if @m; # no space allowed by the pattern
}
Upvotes: 1
Reputation: 66873
There is still a little openness in the description, but this works with shown data
use warnings;
use strict;
use feature 'say';
use utf8;
use open ":std", ":encoding(UTF-8)";
my @strs = (
'(fpgθ) tig <br/>',
'tig (gfpθ) tig<br/>',
'tig (gθfp)<br/>',
);
for (@strs)
{
my @m = /\b( \S*? [fs][pd] \S*? )\b/gx;
say "@m" if @m; # no space allowed by the pattern
}
Depending on clarifications you may want to tweak the \S
and \b
that are used. I capture into an array, with /g
, for strings with more than one match. I left parentheses in for an additional test.
The use utf8
allows UTF-8 in the source, so it's for my @strs
array only.
The use open
pragma, however, is essential as it sets default (PerlIO) input and output layers, in this case standard streams for UTF-8
. So you can read from a file and print to a file or console.
Upvotes: 1