Preg replace utf8 charset issue with à

I'm trying to add a special string '|||' after newlines, blankspaces and other characters. I'm doing this because I want to split my text into an array. So I was thinking to do it like this:

$result = preg_replace("/<br>/", "<br>|||", preg_replace("/\s/", " |||", preg_replace("/\r/", "\r|||", preg_replace("/\n/", "\n|||", preg_replace("/’/", "’|||", preg_replace("/'/", "'|||", $text))))));
$result = preg_split("/[|||]+/", $result);

It works with every word but words which contain à char. It is replaced by �. I'm sure the problem is here because my string $text shows the char à.

Upvotes: 1

Views: 164

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627607

Since your pattern deals with a Unicode string, pass the /u modifier.

Also, you do not need so many chained regex replacements, group the first patterns and use a backreference in the replacement.

Use

preg_replace("/(<br>|[\s’'])/u", "$1|||", $text)

Note that \s matches spaces, carriage returns and newlines.

Details:

  • (<br>|[\s’']) - Group 1 capturing either a
    • <br> - character sequence
    • | - or
    • [\s’'] - a whitespace, or '.

See the PHP demo:

$text = "Voilà. C'est vrai.";
echo preg_replace("/(<br>|[\s’'])/u", "$1|||", $text);

Upvotes: 1

Related Questions