Reputation: 878
I'm trying to find the best way to check if a string has any "weird" UTF-8 characters enclosed.
Basically i'm looking for something that would guard against all the different UTF 8 control symbols non-white spaces that can be hidden in a string. When I say hidden, I am implying, printing the string in the screen would not display those characters. They would either be as spaces, or as blank characters.
My previous approach to solve this was to return true if the string is one of those characters:
function isUnusualString($string) {
if($string == "")
return TRUE;
$char = ord($string);
if($char < 33)
return TRUE;
if($char > 8191 && $char < 8208)
return TRUE;
if($char > 8231 && $char < 8240)
return TRUE;
switch($char) {
case 160: // Non-Breaking Space
case 8287: // Medium Mathematical Space
return TRUE;
break;
}
return FALSE;
}
However, this does not catch all cases, and I don't know why. I'm assuming you can have some of these characters that can be more than length 1, or length 0?.
So next I tried iterating over the characters of the string and checking if the string contains any of the "hidden" characters.
For example running the following code:
function isUnusualUTF($string) {
if($string == "")
return TRUE;
$strlen = strlen($string);
for ($i = 0; $i < $strlen; ++$i) {
$char = ord($string[$i]);
if($char < 33)
echo "char = ".$char." at index: ".$i." is < 33";
if($char > 8191 && $char < 8208)
echo "char = ".$char." at index: ".$i." is >8191 and < 8240 ";
if($char > 8231 && $char < 8240)
echo "char = ".$char." at index: ".$i." is > 8231 and < 8240 ";
switch($char) {
case 160: // Non-Breaking Space
case 8287: // Medium Mathematical Space
echo "cases<br>"; //return TRUE;
break;
}
}
return FALSE;
}
$string = "Unicode ";
echo isUnusualUTF($string);
Outputs:
char = 32 at index: 7 is < 33
I think that the best way to do this would be with a regex that does:
if string has (numbers or letters or " " or other symbols
that can be printed and seen in the screen)
return true
else
return false
Thank you
Upvotes: 0
Views: 541
Reputation: 1867
In php you can use regex to find characters with certain properties using these escapes:
\p{xx} (inclusive)
\P{xx} (exclusive)
Where the xx is a certain property you are looking for.
Here is a list of properties: http://php.net/manual/en/regexp.reference.unicode.php
I think for your case you would want to fashion a statement like this:
[\P{xx}\P{yy}..etc]+
where "...etc" is symbolic and represents additional properties. This should match all of the characters you're looking for.
Here's a link to test your regex statement: http://www.phpliveregex.com/
Upvotes: 1
Reputation: 27227
Use the multibyte versions of those methods:
mb_strlen: https://www.php.net/mb_strlen
Although I believe this method may do exactly what you want: https://www.php.net/manual/en/function.mb-check-encoding.php
Upvotes: 0