Regex to match and capture a word

Question

I wrote a regex in a perl script to find and capture a word that contains the sequence "fp", "fd", "sp" or "sd" in a sentence. However, the word may contain some non-word characters like θ or ð. The word may be at the beginning or end of the sentence. When I tested this regex on regex101.com, it matches even when the input is nothing.The way I interpret this regex is: match one of the patterns "fp", "fd", "sp" or "sd" and capture everything around it until either a whitespace or the beginning of the line on the left side and a whitespace or end of the line on the right side.

This is the regex: ^|\s(.*[fs][ˈ|ˌ]?[pd].*)\s|$

I also tried using the ? quantifier to make the .* pattern lazy, but it still shows a match when the input is nothing.

Here are some examples of what I need it to capture in parentheses:

(fpgθ) tig 

tig (gfpθ) tig

tig (gθfp)

Edit: I forgot to explain the middle part. The [ˈˌ]? part (I made a mistake, I don't need the |) just allows for those characters to be between the [fs] and [pd]. I wouldn't want it to match things like tigf pg. I want it to match any word (defined by the space around it - so in a sentence like tig you rθðthe words it contains are tig, you, and rθð). This "word" could be at the end, at the beginning, or in the middle of the sentence. Is there a way to assert the position at the beginning of the string within a bracket? I think that would solve my problem.

Also, I tried using \w, but because I have things like θ or ð it doesn't match those.

zdim · Accepted Answer

There is still a little openness in the description, but this works with shown data

use warnings;
use strict;
use feature 'say';

use utf8;
use open ":std", ":encoding(UTF-8)";

my @strs = ( 
    '(fpgθ) tig 
',
    'tig (gfpθ) tig
',
    'tig (gθfp)
',
);

for (@strs) 
{
    my @m = /\b( \S*? [fs][pd] \S*? )\b/gx; 

    say "@m" if @m;   # no space allowed by the pattern
}

Depending on clarifications you may want to tweak the \S and \b that are used. I capture into an array, with /g, for strings with more than one match. I left parentheses in for an additional test.

The use utf8 allows UTF-8 in the source, so it's for my @strs array only.

The use open pragma, however, is essential as it sets default (PerlIO) input and output layers, in this case standard streams for UTF-8. So you can read from a file and print to a file or console.

Regex to match and capture a word

Answers (2)

Related Questions