Reputation: 11275
I am using the preg_replace
function to filter out some user inputs. The function below is supposed to filter out control characters in Unicode, but seems like some of these characters are classified as some other categories (punctuations, spaces, etc) instead, allowing them to get pass the filtering. Why is that so?
preg_replace("/[^\p{L}\p{M}\p{N}\p{P}\p{S}]/u", "", $message);
Here are some Unicode control characters that got passed filtering using above method
U+0085 NEXT LINE (NEL) …
U+008C PARTIAL LINE BACKWARD Œ
U+0095 MESSAGE WAITING •
How safe is preg_replace
? And is there a better way to do this?
Upvotes: 1
Views: 1103
Reputation: 536795
In your code you have:
"a…Œ•a"
Which contains:
…
U+2026 Horizontal ellipsisŒ
U+0152 Latin capital ligature OE•
U+2022 BulletAs you might expect, Œ
is a Letter \p{L}
and the other two are Punctuation \p{P}
so all are permitted.
You have been misled by a resource somewhere where someone has said that …
is U+0085, and so on; this is not the case. The likely reason this has happened is that they wrote an HTML file with the numeric character reference …
in it.
In HTML the character references €
to Ÿ
(aka €
to Ÿ
) do not actually mean the Unicode characters with the codepoints U+0080 to U+009F. Instead they mean the characters whose encoded form in the Windows code page 1252 (Western European) encoding lies between 0x80 and 0x9F. Byte 0x85 in code page 1252 is the ellipsis, so …
means U+2026 and not U+0085.
This is due to historical reasons: bugs in ancient browsers that predated a modern understanding of Unicode, copied by others and finally standardised by HTML5. XML does not suffer from this anomaly: in XHTML, …
really is U+0085.
Your expression works OK for the real (invisible, "C1") control characters in code points U+0080-U+009F:
function unichr($i) { // get character from code point, in UTF-8 string form
return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}
$message = 'a'.unichr(0x85).unichr(0x8C).unichr(0x95).'a';
$filtered = preg_replace("/[^\p{L}\p{M}\p{N}\p{P}\p{S}]/u", "", $message);
var_dump($filtered);
<<< string(2) "aa"
Upvotes: 3
Reputation: 40639
Try utf8_encode() before using preg_replace()
like
preg_replace("/[^\p{L}\p{M}\p{N}\p{P}\p{S}]/u", "", utf8_encode($message));
Upvotes: 0