Reputation: 535

Exclude specific characters from Transliterator conversion

I'm trying to make a transliteration using PHP, but what I need is the conversion of all non-latin characters but keep the italian accented characters (àèìòù).

PHP Transliterator lacks of documentation and on-line examples. I've read the ICU docs and I know that there is a rule that force Transliterator to convert a char into another specified by us (à > b).

The code (using the create funciton)

$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
$transliterator = Transliterator::create("Any-Latin; Latin-ASCII");
echo $transliterator->transliterate($str);

converts all non-latin chars into latin (with all the accented chars) and gives the result

ASAaeiou Chen Hai yao Munchen Faisst Finis guo nei - jing xiang

and the code (using createFromRules function)

$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
$transliterator = Transliterator::createFromRules("á>b");
echo $transliterator->transliterate($str);

forces correctly the conversion of à into b, but, obviously, without the conversion Any-Latin; Latin-ASCII made by the previous code, giving the result

AŠAbèìòù Chén Hǎi ybo München Faißt Финиш 国内 - 镜像

So my goal is to merge the Any-Latin; Latin-ASCII conversion and the à > à rule (and the other italian accented vowels), in order to tell Transliterator to convert all non latin chars to latin, but convert italian accented vowels into themselves, with the following result:

ASAàèìòù Chen Hai yao Munchen Faisst Finis guo nei - jing xiang

Is there a way to put the à>à rule in the create function's parameter or add the Any-Latin; Latin-ASCII directive in the createFromRules function's parameter?

Upvotes: 3

Answers (5)

Casimir et Hippolyte

Reputation: 89557

[EDIT] More simple: Use a filter to apply changes only to selected characters:

$str = 'AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像';

$rules = <<<'RULES'
:: [^ÀàÈèÌìÒòÙù];
:: Any-Latin ;
:: Latin-ASCII ;
RULES;

$tls = Transliterator::createFromRules($rules);

echo $tls->transliterate($str), PHP_EOL;
// ASAàèìòù Chen Hai yao Munchen Faisst Finis guo nei - jing xiang

:: [^ÀàÈèÌìÒòÙù] is the filter that excludes selected accented letters.

[OLD ANSWER] (that works too)
You can play with the normalization to protect accented characters you want to preserve before the transliteration from Any to Latin:

$str = 'AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像';

$rules = <<<'RULES'
:: NFC ;
à > a ̀  ;
è > e ̀  ;
ì > i ̀  ;
ò > o ̀  ;
ù > u ̀  ;
:: Any-Latin   ;
:: [^ ̀ ]-ASCII ;
:: NFC ;
RULES;

$tls = Transliterator::createFromRules($rules);

echo $tls->transliterate($str), PHP_EOL;
// ASAàèìòù Chen Hai yao Munchen Faisst Finis guo nei - jing xiang

This way, accented characters you want to protect with a grave accent are the only in a decomposed form (using a combining character). Those coming from the Any-Latin transliteration are in a composed form (they use only one code point). Then, instead of Latin in Latin-ASCII, you can use a set that excludes the combining grave accent.

Upvotes: 2

hakre

Reputation: 197757

Given your example with input and output:

$transliterator = Transliterator::create("Any-Latin; Latin-ASCII");
$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
echo $transliterator->transliterate($str), "\n";

ASAaeiou Chen Hai yao Munchen Faisst Finis guo nei - jing xiang

when applying the transliteration only on the segments that do not match the range of characters you specified to keep (the italian accented characters [àèìòù]) it should provide the result.

One option is to use preg_replace_callback for that.

It requires to have a callback to apply the transliteration:

$transliterate = static function (array $match) use ($transliterator) {
    return $transliterator->transliterate($match[0]);
};

And it requires to have a pattern to match everything but the characters to keep. It needs to be properly defined and compatible with Unicode:

([^\xE0\xE8\xEC\xF2\xF9]+)ui


(...)                : delimiters: the regular expression is inside
u                    : modifier: u - Unicode mode (UTF-8 encoding in
                       PHP, PCRE_UTF8)
i                    : modifier: i - letters in the pattern match
                       both upper and lower case letters
                       (PCRE_CASELESS)

[^...]               : character class: not matching any of the
                       characters (`^`); negated character class
\xE0\xE8\xEC\xF2\xF9 : the italian accented characters àèìòù written
                       in a stable notation (you can easily copy and
                       paste it for example)

Last but not least, the subject to operate on must be compatible with the characters to keep. As there can be many ways to write the same character in Unicode, the input is normalized to be compatible with the PCRE pattern:

echo preg_replace_callback(
    '([^\xE0\xE8\xEC\xF2\xF9]+)ui', 
    $transliterate, 
    Normalizer::normalize($str, Normalizer::NFC)
), "\n";

The output:

ASAàèìòù Chen Hai yao Munchen Faisst Finis guo nei - jing xiang

Example across PHP versions.

Addendum:

\xE0\xE1\xE8\xE9\xEC\xED\xF2\xF3\xF9\xFA lower-case list of italian accented characters (can be used with i-modifier)
\xC0\xC1\xC8\xC9\xCC\xCD\xD2\xD3\xD9\xDA\xE0\xE1\xE8\xE9\xEC\xED\xF2\xF3\xF9\xFA lower- and upper-case list of italian accented characters (can be used without i-modifier)

PCRE Syntax CHARACTERS (excerpt):

   \xhh       character with hex code hh
   \x{hhh..}  character with hex code hhh..

Link to the full PCRE syntax: https://www.pcre.org/original/doc/html/pcresyntax.html

Upvotes: 1

wordragon

Reputation: 1357

A method I have used when trying to fend off unwanted transliteration - it's a little ugly, but works with fairly little effort. Substitute the characters you DON'T want to transliterate with tags, and then replace them after transliteration:

<?php

$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
$str = str_replace(['à', 'è', 'ì', 'ò', 'ù'], ['@@a@@', '@@e@@', '@@i@@', '@@o@@', '@@u@@'], $str);
$transliterator = Transliterator::create("Any-Latin; Latin-ASCII");
$out = $transliterator->transliterate($str);
$out = str_replace(['@@a@@', '@@e@@', '@@i@@', '@@o@@', '@@u@@'], ['à', 'è', 'ì', 'ò', 'ù'], $out);
echo $out;

The result is:

ASAàèìòù Chen Hai yao Munchen Faisst Finis guo nei - jing xiang

Upvotes: 0

mvorisek

Reputation: 3418

You can use preg_replace_callback to filter all characters except the italian accented ones and apply transliteration on it.

Upvotes: 0

Sammitch

Reputation: 32232

All you have to do is remove the Latin-ASCII rule.

$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
$transliterator = Transliterator::create("Any-Latin; Any-NFC");
echo $transliterator->transliterate($str);

Output:

AŠAàèìòù Chén Hǎi yáo München Faißt Finiš guó nèi - jìng xiàng

You might also want to take the opportunity to apply a normalization rule to the string to compose or decompose the accented characters into a consistent form, depending on what you plan to do with them.

$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
$none = Transliterator::create("Any-Latin");
$nfc = Transliterator::create("Any-Latin; Any-NFC");
$nfd = Transliterator::create("Any-Latin; Any-NFD");
var_dump(
    $none->transliterate($str),
    $nfc->transliterate($str),
    $nfd->transliterate($str)
);

Output:

string(78) "AŠAàèìòù Chén Hǎi yáo München Faißt Finiš guó nèi - jìng xiàng"
string(78) "AŠAàèìòù Chén Hǎi yáo München Faißt Finiš guó nèi - jìng xiàng"
string(93) "AŠAàèìòù Chén Hǎi yáo München Faißt Finiš guó nèi - jìng xiàng"

NFC is "composed", as in all accented characters that have a single-codepoint representation are represented as such. NFD is "decomposed" and all accented characters are split into their base codepoint and an accent combining mark. In both cases, multiple combining marks on a single base character will be arranged in a consistent manner.

Some filesystems require a certain form, eg: Mac requires NFD, and some will simply accept anything, eg: ext, creating "duplicate" files with mixed composition that are tricky to deal with.

Upvotes: 0

Exclude specific characters from Transliterator conversion

Answers (5)

Related Questions