Eissaweb
Eissaweb

Reputation: 192

How to find unicode character class in PHP

I am having hard times finding a way to get the unicode class of a char.

list of unicode classes: https://www.php.net/manual/en/regexp.reference.unicode.php

The desired function in python: https://docs.python.org/3/library/unicodedata.html#unicodedata.category

I just want the PHP equivalent to this python function.

For example, if I called the x function like this: x('-') it would return Pd because Pd is the class hyphen belongs to.

Thanks.

Upvotes: 2

Views: 324

Answers (3)

sln
sln

Reputation: 2759

I'm posting this as it might be useful. Have done this before on a very large scale.

Below is a condensed way to do it using PHP.

Notes:

A single regex is generated once at startup.
It contains a Lookahead Assertion with a capture group for each Property.
Example: (?=(\p{Property1}))?(?=(\p{Property2}))? ... (?=(\p{PropertyN}))?
Each character in the target is checked for all the properties in the array.
Each capture group represents an index into the character array $General_Cat_Props
that is it's association when a match is analyzed
for printing.

This solves the issues that each character can be matched by many properties.
Basically add the properties of interest to $General_Cat_Props.
No other change is necessary.

There are 2 functions:

  1. Get_UniCategories_From_Char( $char ) analyze a character at a time.
  2. Get_UniCategories_From_String( $str ) for strings ( calls 1 on each character ).

Obviously it is noteworthy that the array $General_Cat_Props below can be added to or removed from as needed, for a custom filter.
There can be many specific constant property arrays as needed for special checks. The array order of the properties is irrelevant.

Regex101 quick global test bed

/(?=.)(?=(\p{Cn}))?(?=(\p{Cc}))?(?=(\p{Cf}))?(?=(\p{Co}))?(?=(\p{Cs}))?(?=(\p{Lu}))?(?=(\p{Ll}))?(?=(\p{Lt}))?(?=(\p{Lm}))?(?=(\p{Lo}))?(?=(\p{Mn}))?(?=(\p{Me}))?(?=(\p{Mc}))?(?=(\p{Pd}))?(?=(\p{Ps}))?(?=(\p{Pe}))?(?=(\p{Pc}))?(?=(\p{Po}))?(?=(\p{Pi}))?(?=(\p{Pf}))?(?=(\p{Sm}))?(?=(\p{Sc}))?(?=(\p{Sk}))?(?=(\p{So}))?(?=(\p{Zs}))?(?=(\p{Zl}))?(?=(\p{Zp}))?/su

https://regex101.com/r/fvVZX0/1

PHP
Mod: After realizing php only populates the $match array up until the last optional group matched, a check was added when creating the result (see $last_grp_matched = sizeof($matches);).

Previously it was being forced by adding a capture group (.) at the end. The old code still works, use/see previous version if needed.

http://sandbox.onlinephpfunctions.com/code/f1aeca3d9a99d1b2d1bfc72c3dd004ad232bc29e

<?php

// The prop array
$General_Cat_Props = [
"",
"Cn", "Cc", "Cf", "Co", "Cs",
"Lu", "Ll", "Lt", "Lm", "Lo",
"Mn", "Me", "Mc", // "Nd", "Nl", "No",
"Pd", "Ps", "Pe", "Pc", "Po", "Pi", "Pf",
"Sm", "Sc", "Sk", "So",
"Zs", "Zl", "Zp"
];

// The Rx
$GCRx;

// One-time make function
function makeGCRx()
{
    global $General_Cat_Props, $GCRx ;
    $rxstr = "(?=.)";     // Start of regex, something must be ahead
    for ($i = 1; $i < sizeof( $General_Cat_Props ); $i++) {
        $rxstr .= "(?=(\\p{" . $General_Cat_Props[ $i ] . "}))?";
    }
    $GCRx = "/$rxstr/su";
}

makeGCRx();
// print_r($GCRx . "\n");

function Get_UniCategories_From_Char( $char )
{
    global $General_Cat_Props, $GCRx;
    $ret = "";
    if ( preg_match( $GCRx, $char, $matches )) {
        $last_grp_matched = sizeof($matches);
        for ($i = 1; $i < sizeof( $General_Cat_Props ), $i < $last_grp_matched; $i++) {
            if ( $matches[ $i ] != null ) {
                $ret .= $General_Cat_Props[ $i ] . " ";
            }
        }
    }
    return $ret;
}

function Get_UniCategories_From_String( $str )
{
    $ret = "";
    for ($i = 0; $i < strlen( $str ); $i++) {
        $ret .= $str[ $i ] . "  " . Get_UniCategories_From_Char( $str[ $i ] ) . "\n";
    }
    return $ret;
}

print_r( "-  " . Get_UniCategories_From_Char( "-" ) . "\n--------\n" );
// or 
print_r( Get_UniCategories_From_String( "Hello 270 -,+?" ) . "\n" );

Output:

-  Pd 
--------
H  Lu 
e  Ll 
l  Ll 
l  Ll 
o  Ll 
   Zs 
2  
7  
0  
   Zs 
-  Pd 
,  Po 
+  Sm 
?  Po 

Upvotes: 1

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

A possible way is to use IntlChar::charType. Unfortunately, this method returns only an int, but this int is a constant defined in the IntlChar class. All the constants for the 30 categories are in a 0 to 29 range (no gaps). Conclusion, all you have to do is to build a indexed array that follows the same order:

$shortCats = [
    'Cn', 'Lu', 'Ll', 'Lt', 'Lm', 'Lo',
    'Mn', 'Me', 'Mc', 'Nd', 'Nl', 'No',
    'Zs', 'Zl', 'Zp', 'Cc', 'Cf', 'Co',
    'Cs', 'Pd', 'Ps', 'Pe', 'Pc', 'Po',
    'Sm', 'Sc', 'Sk', 'So', 'Pi', 'Pf'
];

echo $shortCats[IntlChar::charType('-')]; //Pd

Notice: If you are afraid that the numeric values defined in the class change in the futur and want to be more rigorous, You can also write the array this way:

$shortCats = [
    IntlChar::CHAR_CATEGORY_UNASSIGNED => 'Cn',
    IntlChar::CHAR_CATEGORY_UPPERCASE_LETTER => 'Lu',
    IntlChar::CHAR_CATEGORY_LOWERCASE_LETTER => 'Ll',
    IntlChar::CHAR_CATEGORY_TITLECASE_LETTER => 'Lt',
    // etc.
];

Upvotes: 3

Eissaweb
Eissaweb

Reputation: 192

So Apparently there is no built-in function that does that, so I wrote this function:

<?php
$UNICODE_CATEGORIES = [
        "Cc",
        "Cf",
        "Cs",
        "Co",
        "Cn",
        "Lm",
        "Mn",
        "Mc",
        "Me",
        "No",
        "Zs",
        "Zl" ,
        "Zp",
        "Pc",
        "Pd",
        "Ps" ,
        "Pe" ,
        "Pi" ,
        "Pf" ,
        "Po" ,
        "Sm",
        "Sc",
        "Sk",
        "So",
        "Zs",
        "Zl",
        "Zp"
    ];

function uni_category($char, $UNICODE_CATEGORIES) {
    foreach ($UNICODE_CATEGORIES as $category) {
        if (preg_match('/\p{'.$category.'}/', $char))
            return $category;
    } 
    return null;
}
// call the function 
print uni_category('-', $UNICODE_CATEGORIES); // it returns Pd

This code works for me, I hope it helps someby in the future :).

Upvotes: 2

Related Questions