Implement word boundaries with regex alternations and words that might not begin/end with a word character

I have the following code:

//Array filled with data from external file
$patterns = array('!test!', 'stuff1', 'all!!', '');

//Delete empty values in array
$patterns = array_filter($patterns);

foreach($patterns as &$item){
       $item = preg_quote($item);
}

$pattern = '/(\b|^|- |--|-)(?:'.implode('|', $patterns).')(-|--| -|\b|$)/i';

$clid = "I am the !test! stuff1 all!! string";

echo $clid;
$clid = trim(preg_replace($pattern, ' ', $clid));
echo $clid;

Output:

//I am the !test! stuff1 all!! string
//I am the !test! all!! string

I'm escaping the ! with preg_quote(), so why?

I had a second problem, which is now solved, but I don't know why it happened. Suppose $clid = "I am Jörg Müller with special chars". If I remove the code line $patterns = array_filter($patterns); then the output after preg_replace() was I am J. I cannot find out why, but I solved the problem with array_filter().

Upvotes: 0

Answers (2)

Mariano

Reputation: 6511

The problem is you're using \b to assert for word boundaries. However, the character "!" is not a word character and \b doesn't match in between " !".

These are the word boundaries in $clid:

 I   a m   t h e   ! t e s t !   s t u f f 1   a l l ! !   s t r i n g
^ ^ ^   ^ ^     ^   ^       ^   ^           ^ ^     ^     ^           ^

You could use lookarounds to assert that each item is:

(?:-[- ]?| +) matches -[ ], -, -- or one or more spaces.
(?:-[- ]?|(?= )|$) matches -[ ], -, -- or asserts it's followed by a space or the end of line.

Regex

$pattern = '/(?:-[- ]?| +)(?:'.implode('|', $patterns).')(?:-[- ]?|(?= )|$)/i';

Code

//Array filled with data from external file
$patterns = array('!test!', 'stuff1', 'all!!', '');

//Delete empty values in array
$patterns = array_filter($patterns);

foreach($patterns as &$item){
       $item = preg_quote($item);
}

$pattern = '/(?:-[- ]?| +)(?:'.implode('|', $patterns).')(?:-[- ]?|(?= )|$)/i';


$clid = "I am the !test! stuff1 all!! string and !test!! not matched";
$clid = trim(preg_replace($pattern, '', $clid));

echo $clid;

Output

I am the string and !test!! not matched

ideone demo

As for your second question, you have an empty item in your array. So the regex would turn up to be:

(?:option1|option2|option3|)
                           ^

Notice there's a 4th option there: an empty subpattern. And an empty subpattern always matches. Your regex could be interpreted as:

/(\b|^|- |--|-)(-|--| -|\b|$)/i

which is why you had unexpected results

array_filter() solved your problem by removing empty items.

Upvotes: 1

Casimir et Hippolyte

Reputation: 89564

The way I will do that:

$clid = "I am the !test! stuff1 all!! string";

$items = ['!test!', 'stuff1', 'all!!', ''];

$pattern = array_reduce($items, function ($c, $i) {
    return empty($i) ? $c : $c . preg_quote($i, '~') . '|';
}, '~[- ]+(?:');

$pattern .= '(*F))(?=[- ])~u';

$result = preg_replace($pattern, '', ' ' . $clid . ' ');
$result = trim($result, "- \t\n\r\0\x0b");

demo

The idea is to check a space or an hyphen after the "word" with a lookahead. In this way this "separator" is not consumed and the pattern can deal with consecutive matches.

To avoid an alternation at the beginning of the pattern (like (?:[- ]|^)[- ]* that is slow), I add a space at the beginning of the source string that is removed after the replacement with trim.

The (*F) (that forces the pattern to fail) is only here because the alternation of items is build with array_reduce that lets a trailing | at the end.

The problem with characters out of the ASCII range is solved with the u modifier. With this modifier the regex engine is able to deal with UTF-8 encoded strings.

Upvotes: 1

Implement word boundaries with regex alternations and words that might not begin/end with a word character

Answers (2)

Related Questions