Reputation: 1521
I have the following code:
//Array filled with data from external file
$patterns = array('!test!', 'stuff1', 'all!!', '');
//Delete empty values in array
$patterns = array_filter($patterns);
foreach($patterns as &$item){
$item = preg_quote($item);
}
$pattern = '/(\b|^|- |--|-)(?:'.implode('|', $patterns).')(-|--| -|\b|$)/i';
$clid = "I am the !test! stuff1 all!! string";
echo $clid;
$clid = trim(preg_replace($pattern, ' ', $clid));
echo $clid;
Output:
//I am the !test! stuff1 all!! string
//I am the !test! all!! string
I'm escaping the !
with preg_quote()
, so why?
I had a second problem, which is now solved, but I don't know why it happened.
Suppose $clid = "I am Jörg Müller with special chars"
. If I remove the code line $patterns = array_filter($patterns);
then the output after preg_replace()
was I am J
. I cannot find out why, but I solved the problem with array_filter()
.
Upvotes: 0
Views: 174
Reputation: 6511
The problem is you're using \b
to assert for word boundaries. However, the character "!"
is not a word character and \b
doesn't match in between " !"
.
These are the word boundaries in $clid
:
I a m t h e ! t e s t ! s t u f f 1 a l l ! ! s t r i n g
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
You could use lookarounds to assert that each item is:
(?:-[- ]?| +)
matches -[ ]
, -
, --
or one or more spaces.(?:-[- ]?|(?= )|$)
matches -[ ]
, -
, --
or asserts it's followed by a space or the end of line.Regex
$pattern = '/(?:-[- ]?| +)(?:'.implode('|', $patterns).')(?:-[- ]?|(?= )|$)/i';
Code
//Array filled with data from external file
$patterns = array('!test!', 'stuff1', 'all!!', '');
//Delete empty values in array
$patterns = array_filter($patterns);
foreach($patterns as &$item){
$item = preg_quote($item);
}
$pattern = '/(?:-[- ]?| +)(?:'.implode('|', $patterns).')(?:-[- ]?|(?= )|$)/i';
$clid = "I am the !test! stuff1 all!! string and !test!! not matched";
$clid = trim(preg_replace($pattern, '', $clid));
echo $clid;
Output
I am the string and !test!! not matched
As for your second question, you have an empty item in your array. So the regex would turn up to be:
(?:option1|option2|option3|)
^
Notice there's a 4th option there: an empty subpattern. And an empty subpattern always matches. Your regex could be interpreted as:
/(\b|^|- |--|-)(-|--| -|\b|$)/i
which is why you had unexpected results
array_filter()
solved your problem by removing empty items.
Upvotes: 1
Reputation: 89564
The way I will do that:
$clid = "I am the !test! stuff1 all!! string";
$items = ['!test!', 'stuff1', 'all!!', ''];
$pattern = array_reduce($items, function ($c, $i) {
return empty($i) ? $c : $c . preg_quote($i, '~') . '|';
}, '~[- ]+(?:');
$pattern .= '(*F))(?=[- ])~u';
$result = preg_replace($pattern, '', ' ' . $clid . ' ');
$result = trim($result, "- \t\n\r\0\x0b");
The idea is to check a space or an hyphen after the "word" with a lookahead. In this way this "separator" is not consumed and the pattern can deal with consecutive matches.
To avoid an alternation at the beginning of the pattern (like (?:[- ]|^)[- ]*
that is slow), I add a space at the beginning of the source string that is removed after the replacement with trim
.
The (*F)
(that forces the pattern to fail) is only here because the alternation of items is build with array_reduce
that lets a trailing |
at the end.
The problem with characters out of the ASCII range is solved with the u modifier. With this modifier the regex engine is able to deal with UTF-8 encoded strings.
Upvotes: 1