Reputation: 192
I am having hard times finding a way to get the unicode class of a char.
list of unicode classes: https://www.php.net/manual/en/regexp.reference.unicode.php
The desired function in python: https://docs.python.org/3/library/unicodedata.html#unicodedata.category
I just want the PHP equivalent to this python function.
For example, if I called the x function like this: x('-') it would return Pd
because Pd is the class hyphen belongs to.
Thanks.
Upvotes: 2
Views: 324
Reputation: 2759
I'm posting this as it might be useful. Have done this before on a very large scale.
Below is a condensed way to do it using PHP.
Notes:
A single regex is generated once at startup.
It contains a Lookahead Assertion with a capture group for each Property.
Example: (?=(\p{Property1}))?(?=(\p{Property2}))? ... (?=(\p{PropertyN}))?
Each character in the target is checked for all the properties in the array.
Each capture group represents an index into the character array $General_Cat_Props
that is it's association when a match is analyzed
for printing.
This solves the issues that each character can be matched by many properties.
Basically add the properties of interest to $General_Cat_Props
.
No other change is necessary.
There are 2 functions:
Obviously it is noteworthy that the array $General_Cat_Props
below can be added to or removed from as needed, for a custom filter.
There can be many specific constant property arrays as needed for special checks. The array order of the properties is irrelevant.
Regex101 quick global test bed
/(?=.)(?=(\p{Cn}))?(?=(\p{Cc}))?(?=(\p{Cf}))?(?=(\p{Co}))?(?=(\p{Cs}))?(?=(\p{Lu}))?(?=(\p{Ll}))?(?=(\p{Lt}))?(?=(\p{Lm}))?(?=(\p{Lo}))?(?=(\p{Mn}))?(?=(\p{Me}))?(?=(\p{Mc}))?(?=(\p{Pd}))?(?=(\p{Ps}))?(?=(\p{Pe}))?(?=(\p{Pc}))?(?=(\p{Po}))?(?=(\p{Pi}))?(?=(\p{Pf}))?(?=(\p{Sm}))?(?=(\p{Sc}))?(?=(\p{Sk}))?(?=(\p{So}))?(?=(\p{Zs}))?(?=(\p{Zl}))?(?=(\p{Zp}))?/su
https://regex101.com/r/fvVZX0/1
PHP
Mod: After realizing php only populates the $match
array up until the last optional group matched, a check was added when creating the result (see $last_grp_matched = sizeof($matches);
).
Previously it was being forced by adding a capture group (.)
at the end. The old code still works, use/see previous version if needed.
http://sandbox.onlinephpfunctions.com/code/f1aeca3d9a99d1b2d1bfc72c3dd004ad232bc29e
<?php
// The prop array
$General_Cat_Props = [
"",
"Cn", "Cc", "Cf", "Co", "Cs",
"Lu", "Ll", "Lt", "Lm", "Lo",
"Mn", "Me", "Mc", // "Nd", "Nl", "No",
"Pd", "Ps", "Pe", "Pc", "Po", "Pi", "Pf",
"Sm", "Sc", "Sk", "So",
"Zs", "Zl", "Zp"
];
// The Rx
$GCRx;
// One-time make function
function makeGCRx()
{
global $General_Cat_Props, $GCRx ;
$rxstr = "(?=.)"; // Start of regex, something must be ahead
for ($i = 1; $i < sizeof( $General_Cat_Props ); $i++) {
$rxstr .= "(?=(\\p{" . $General_Cat_Props[ $i ] . "}))?";
}
$GCRx = "/$rxstr/su";
}
makeGCRx();
// print_r($GCRx . "\n");
function Get_UniCategories_From_Char( $char )
{
global $General_Cat_Props, $GCRx;
$ret = "";
if ( preg_match( $GCRx, $char, $matches )) {
$last_grp_matched = sizeof($matches);
for ($i = 1; $i < sizeof( $General_Cat_Props ), $i < $last_grp_matched; $i++) {
if ( $matches[ $i ] != null ) {
$ret .= $General_Cat_Props[ $i ] . " ";
}
}
}
return $ret;
}
function Get_UniCategories_From_String( $str )
{
$ret = "";
for ($i = 0; $i < strlen( $str ); $i++) {
$ret .= $str[ $i ] . " " . Get_UniCategories_From_Char( $str[ $i ] ) . "\n";
}
return $ret;
}
print_r( "- " . Get_UniCategories_From_Char( "-" ) . "\n--------\n" );
// or
print_r( Get_UniCategories_From_String( "Hello 270 -,+?" ) . "\n" );
Output:
- Pd
--------
H Lu
e Ll
l Ll
l Ll
o Ll
Zs
2
7
0
Zs
- Pd
, Po
+ Sm
? Po
Upvotes: 1
Reputation: 89557
A possible way is to use IntlChar::charType
. Unfortunately, this method returns only an int, but this int is a constant defined in the IntlChar
class. All the constants for the 30 categories are in a 0 to 29 range (no gaps). Conclusion, all you have to do is to build a indexed array that follows the same order:
$shortCats = [
'Cn', 'Lu', 'Ll', 'Lt', 'Lm', 'Lo',
'Mn', 'Me', 'Mc', 'Nd', 'Nl', 'No',
'Zs', 'Zl', 'Zp', 'Cc', 'Cf', 'Co',
'Cs', 'Pd', 'Ps', 'Pe', 'Pc', 'Po',
'Sm', 'Sc', 'Sk', 'So', 'Pi', 'Pf'
];
echo $shortCats[IntlChar::charType('-')]; //Pd
Notice: If you are afraid that the numeric values defined in the class change in the futur and want to be more rigorous, You can also write the array this way:
$shortCats = [
IntlChar::CHAR_CATEGORY_UNASSIGNED => 'Cn',
IntlChar::CHAR_CATEGORY_UPPERCASE_LETTER => 'Lu',
IntlChar::CHAR_CATEGORY_LOWERCASE_LETTER => 'Ll',
IntlChar::CHAR_CATEGORY_TITLECASE_LETTER => 'Lt',
// etc.
];
Upvotes: 3
Reputation: 192
So Apparently there is no built-in function that does that, so I wrote this function:
<?php
$UNICODE_CATEGORIES = [
"Cc",
"Cf",
"Cs",
"Co",
"Cn",
"Lm",
"Mn",
"Mc",
"Me",
"No",
"Zs",
"Zl" ,
"Zp",
"Pc",
"Pd",
"Ps" ,
"Pe" ,
"Pi" ,
"Pf" ,
"Po" ,
"Sm",
"Sc",
"Sk",
"So",
"Zs",
"Zl",
"Zp"
];
function uni_category($char, $UNICODE_CATEGORIES) {
foreach ($UNICODE_CATEGORIES as $category) {
if (preg_match('/\p{'.$category.'}/', $char))
return $category;
}
return null;
}
// call the function
print uni_category('-', $UNICODE_CATEGORIES); // it returns Pd
This code works for me, I hope it helps someby in the future :).
Upvotes: 2