Reputation: 14324
I'm trying to detect if a Unicode string is printable.
For example, I have a user who has set their name to %EF%B8%8F
- which is variation selector-16 (U+FE0F)
I want to be able to do something like
if ($screen_name == null || $screen_name == NotPrintable )
{
...Show an error...
} else {
...Proceed as normal...
}
Is there any way to detect if a Unicode string is printable?
Users names can be any valid Unicode sequence (English, Chinese, Arabic, etc).
Some previous answers suggest complex regexes which look like they only work with a narrow range of characters.
I've tried counting the length of the string, but that doesn't work -
$odd = urldecode("%EF%B8%8F");
print strlen($odd);
3
Same result for mb_strlen()
as well.
Functions like ctype_print()
won't work because regular strings can contain non-printable characters.
So, is there any way to detect whether a Unicode string will display printable characters?
Upvotes: 0
Views: 545
Reputation: 1617
Working from the PHP regexp guide for unicode, I assume you want to keep all letters (L), Marks (M), Numbers (N), Punctuation (P), symbols (S) and spaces (Z) and dump everything else (such as control characters). Therefore, a regexp of:
$out=preg_replace('/[^\pL|\pM|\pN|\pP|\pS|\pZ]/u','',$in);
appears to do the trick.
[edit]
Well, that doesn't work with the provided
$in=urldecode('%EF%B8%8F');
example (which decodes to Unicode code point U+FE0F / ️. The following code does handle it:
$len=mb_strlen($in);
$out='';
$disallowedTypes=[IntlChar::CHAR_CATEGORY_NON_SPACING_MARK];
for ($i=0;$i<$len;$i++) {
$char=mb_substr($in,$i,1);
$type=IntlChar::charType($char);
if (false===in_array($type,$disallowedTypes)) {
$out.=$char;
//print 'Adding ord '.dechex(IntlChar::ord($char)).' which is '.IntlChar::charType($char).PHP_EOL;
}
}
but I'm not happy iterating through a string and comparing each character... Please let me know if you find a better way.
Upvotes: 2
Reputation: 1558
What about this Regex?
<?php
define("CTYPE_PRINT_UNICODE_PATTERN", "~^[\pL\pN\s\"\~". preg_quote("!#$%&'()*+,-./:;<=>?@[\]^_`{|}´") ."]+$~u");
function ctype_print_unicode($input) {
return preg_match(CTYPE_PRINT_UNICODE_PATTERN, $input);
}
print ctype_print_unicode("3 muços?"); // 1
Upvotes: 0