Terence Eden
Terence Eden

Reputation: 14324

String comparison for "blank" Unicode characters in PHP

I'm trying to detect if a Unicode string is printable.

For example, I have a user who has set their name to %EF%B8%8F - which is variation selector-16 (U+FE0F)

I want to be able to do something like

if ($screen_name == null || $screen_name == NotPrintable )
{
    ...Show an error...
} else  {
    ...Proceed as normal...
}

Is there any way to detect if a Unicode string is printable?

Users names can be any valid Unicode sequence (English, Chinese, Arabic, etc).

Some previous answers suggest complex regexes which look like they only work with a narrow range of characters.

I've tried counting the length of the string, but that doesn't work -

$odd = urldecode("%EF%B8%8F"); print strlen($odd); 3 Same result for mb_strlen() as well.

Functions like ctype_print() won't work because regular strings can contain non-printable characters.

So, is there any way to detect whether a Unicode string will display printable characters?

Upvotes: 0

Views: 545

Answers (2)

Richy B.
Richy B.

Reputation: 1617

Working from the PHP regexp guide for unicode, I assume you want to keep all letters (L), Marks (M), Numbers (N), Punctuation (P), symbols (S) and spaces (Z) and dump everything else (such as control characters). Therefore, a regexp of:

$out=preg_replace('/[^\pL|\pM|\pN|\pP|\pS|\pZ]/u','',$in);

appears to do the trick.

[edit]

Well, that doesn't work with the provided

$in=urldecode('%EF%B8%8F');

example (which decodes to Unicode code point U+FE0F / ️. The following code does handle it:

$len=mb_strlen($in);
$out='';
$disallowedTypes=[IntlChar::CHAR_CATEGORY_NON_SPACING_MARK];
for ($i=0;$i<$len;$i++) {
 $char=mb_substr($in,$i,1);
 $type=IntlChar::charType($char);
 if (false===in_array($type,$disallowedTypes)) {
  $out.=$char;
  //print 'Adding ord '.dechex(IntlChar::ord($char)).' which is '.IntlChar::charType($char).PHP_EOL;
 }
}

but I'm not happy iterating through a string and comparing each character... Please let me know if you find a better way.

Upvotes: 2

Gennadiy Litvinyuk
Gennadiy Litvinyuk

Reputation: 1558

What about this Regex?

<?php
define("CTYPE_PRINT_UNICODE_PATTERN", "~^[\pL\pN\s\"\~". preg_quote("!#$%&'()*+,-./:;<=>?@[\]^_`{|}´") ."]+$~u");

function ctype_print_unicode($input) {
    return preg_match(CTYPE_PRINT_UNICODE_PATTERN, $input);
}

print ctype_print_unicode("3 muços?"); // 1

Upvotes: 0

Related Questions