Lisa
Lisa

Reputation: 959

Regex to match and capture a word

I wrote a regex in a perl script to find and capture a word that contains the sequence "fp", "fd", "sp" or "sd" in a sentence. However, the word may contain some non-word characters like θ or ð. The word may be at the beginning or end of the sentence. When I tested this regex on regex101.com, it matches even when the input is nothing.The way I interpret this regex is: match one of the patterns "fp", "fd", "sp" or "sd" and capture everything around it until either a whitespace or the beginning of the line on the left side and a whitespace or end of the line on the right side.

This is the regex: ^|\s(.*[fs][ˈ|ˌ]?[pd].*)\s|$

I also tried using the ? quantifier to make the .* pattern lazy, but it still shows a match when the input is nothing.

Here are some examples of what I need it to capture in parentheses:

(fpgθ) tig <br/>
tig (gfpθ) tig<br/>
tig (gθfp)<br/>

Edit: I forgot to explain the middle part. The [ˈˌ]? part (I made a mistake, I don't need the |) just allows for those characters to be between the [fs] and [pd]. I wouldn't want it to match things like tigf pg. I want it to match any word (defined by the space around it - so in a sentence like tig you rθðthe words it contains are tig, you, and rθð). This "word" could be at the end, at the beginning, or in the middle of the sentence. Is there a way to assert the position at the beginning of the string within a bracket? I think that would solve my problem.

Also, I tried using \w, but because I have things like θ or ð it doesn't match those.

Upvotes: 1

Views: 869

Answers (2)

wp78de
wp78de

Reputation: 18950

find and capture a word that contains the sequence "fp", "fd", "sp" or "sd" in a sentence. However, the word may contain some non-word characters like θ or ð.

You should match Unicode letters \p{L} instead of regular word characters \w:

\p{L}*[fs][pd]\p{L}*

Click on the pattern to try it online. I have simplified the pattern according to your latest edits.

use warnings;
use strict;

use utf8;
use open ":std", ":encoding(UTF-8)";

my @regex = qr/\p{L}*[fs][pd]\p{L}*/mp;
my @strs = 'fpgθ tig <br/>
tig gfpθ tig<br/>
tig gθfp<br/>
fptig gfpθ tig<br/>
sddgsdθ(θ@) tig gθfp<br/>';

for (@strs) 
{
    my @m = /@regex/gm;
    print "@m" if @m;   # no space allowed by the pattern
}

Upvotes: 1

zdim
zdim

Reputation: 66873

There is still a little openness in the description, but this works with shown data

use warnings;
use strict;
use feature 'say';

use utf8;
use open ":std", ":encoding(UTF-8)";

my @strs = ( 
    '(fpgθ) tig <br/>',
    'tig (gfpθ) tig<br/>',
    'tig (gθfp)<br/>',
);

for (@strs) 
{
    my @m = /\b( \S*? [fs][pd] \S*? )\b/gx; 

    say "@m" if @m;   # no space allowed by the pattern
}

Depending on clarifications you may want to tweak the \S and \b that are used. I capture into an array, with /g, for strings with more than one match. I left parentheses in for an additional test.

The use utf8 allows UTF-8 in the source, so it's for my @strs array only.

The use open pragma, however, is essential as it sets default (PerlIO) input and output layers, in this case standard streams for UTF-8. So you can read from a file and print to a file or console.

Upvotes: 1

Related Questions