Question Overflow
Question Overflow

Reputation: 11275

preg_replace not working correctly with Unicode?

I am using the preg_replace function to filter out some user inputs. The function below is supposed to filter out control characters in Unicode, but seems like some of these characters are classified as some other categories (punctuations, spaces, etc) instead, allowing them to get pass the filtering. Why is that so?

preg_replace("/[^\p{L}\p{M}\p{N}\p{P}\p{S}]/u", "", $message);

Here are some Unicode control characters that got passed filtering using above method

U+0085  NEXT LINE (NEL)     …
U+008C  PARTIAL LINE BACKWARD   Œ
U+0095  MESSAGE WAITING     •

DEMO

How safe is preg_replace? And is there a better way to do this?

Upvotes: 1

Views: 1103

Answers (2)

bobince
bobince

Reputation: 536795

In your code you have:

"a…Œ•a"

Which contains:

  • U+2026 Horizontal ellipsis
  • Œ U+0152 Latin capital ligature OE
  • U+2022 Bullet

As you might expect, Œ is a Letter \p{L} and the other two are Punctuation \p{P} so all are permitted.

You have been misled by a resource somewhere where someone has said that is U+0085, and so on; this is not the case. The likely reason this has happened is that they wrote an HTML file with the numeric character reference … in it.

In HTML the character references € to Ÿ (aka € to Ÿ) do not actually mean the Unicode characters with the codepoints U+0080 to U+009F. Instead they mean the characters whose encoded form in the Windows code page 1252 (Western European) encoding lies between 0x80 and 0x9F. Byte 0x85 in code page 1252 is the ellipsis, so … means U+2026 and not U+0085.

This is due to historical reasons: bugs in ancient browsers that predated a modern understanding of Unicode, copied by others and finally standardised by HTML5. XML does not suffer from this anomaly: in XHTML, … really is U+0085.

Your expression works OK for the real (invisible, "C1") control characters in code points U+0080-U+009F:

function unichr($i) { // get character from code point, in UTF-8 string form
    return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}

$message = 'a'.unichr(0x85).unichr(0x8C).unichr(0x95).'a';
$filtered = preg_replace("/[^\p{L}\p{M}\p{N}\p{P}\p{S}]/u", "", $message);
var_dump($filtered);

<<< string(2) "aa"

Upvotes: 3

Rohan Kumar
Rohan Kumar

Reputation: 40639

Try utf8_encode() before using preg_replace() like

preg_replace("/[^\p{L}\p{M}\p{N}\p{P}\p{S}]/u", "", utf8_encode($message));

Upvotes: 0

Related Questions