Reputation: 535
I'm trying to make a transliteration using PHP, but what I need is the conversion of all non-latin characters but keep the italian accented characters (àèìòù).
PHP Transliterator lacks of documentation and on-line examples.
I've read the ICU docs and I know that there is a rule that force Transliterator to convert a char into another specified by us (à > b
).
The code (using the create
funciton)
$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
$transliterator = Transliterator::create("Any-Latin; Latin-ASCII");
echo $transliterator->transliterate($str);
converts all non-latin chars into latin (with all the accented chars) and gives the result
ASAaeiou Chen Hai yao Munchen Faisst Finis guo nei - jing xiang
and the code (using createFromRules
function)
$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
$transliterator = Transliterator::createFromRules("á>b");
echo $transliterator->transliterate($str);
forces correctly the conversion of à
into b
, but, obviously, without the conversion Any-Latin; Latin-ASCII
made by the previous code, giving the result
AŠAbèìòù Chén Hǎi ybo München Faißt Финиш 国内 - 镜像
So my goal is to merge the Any-Latin; Latin-ASCII
conversion and the à > à
rule (and the other italian accented vowels), in order to tell Transliterator to convert all non latin chars to latin, but convert italian accented vowels into themselves, with the following result:
ASAàèìòù Chen Hai yao Munchen Faisst Finis guo nei - jing xiang
Is there a way to put the à>à
rule in the create
function's parameter or add the Any-Latin; Latin-ASCII
directive in the createFromRules
function's parameter?
Upvotes: 3
Views: 1085
Reputation: 89557
[EDIT] More simple: Use a filter to apply changes only to selected characters:
$str = 'AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像';
$rules = <<<'RULES'
:: [^ÀàÈèÌìÒòÙù];
:: Any-Latin ;
:: Latin-ASCII ;
RULES;
$tls = Transliterator::createFromRules($rules);
echo $tls->transliterate($str), PHP_EOL;
// ASAàèìòù Chen Hai yao Munchen Faisst Finis guo nei - jing xiang
:: [^ÀàÈèÌìÒòÙù]
is the filter that excludes selected accented letters.
[OLD ANSWER] (that works too)
You can play with the normalization to protect accented characters you want to preserve before the transliteration from Any to Latin:
$str = 'AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像';
$rules = <<<'RULES'
:: NFC ;
à > a ̀ ;
è > e ̀ ;
ì > i ̀ ;
ò > o ̀ ;
ù > u ̀ ;
:: Any-Latin ;
:: [^ ̀ ]-ASCII ;
:: NFC ;
RULES;
$tls = Transliterator::createFromRules($rules);
echo $tls->transliterate($str), PHP_EOL;
// ASAàèìòù Chen Hai yao Munchen Faisst Finis guo nei - jing xiang
This way, accented characters you want to protect with a grave accent are the only in a decomposed form (using a combining character). Those coming from the Any-Latin
transliteration are in a composed form (they use only one code point).
Then, instead of Latin
in Latin-ASCII
, you can use a set that excludes the combining grave accent.
Upvotes: 2
Reputation: 197757
Given your example with input and output:
$transliterator = Transliterator::create("Any-Latin; Latin-ASCII");
$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
echo $transliterator->transliterate($str), "\n";
ASAaeiou Chen Hai yao Munchen Faisst Finis guo nei - jing xiang
when applying the transliteration only on the segments that do not match the range of characters you specified to keep (the italian accented characters [àèìòù]) it should provide the result.
One option is to use preg_replace_callback
for that.
It requires to have a callback to apply the transliteration:
$transliterate = static function (array $match) use ($transliterator) {
return $transliterator->transliterate($match[0]);
};
And it requires to have a pattern to match everything but the characters to keep. It needs to be properly defined and compatible with Unicode:
([^\xE0\xE8\xEC\xF2\xF9]+)ui
(...) : delimiters: the regular expression is inside
u : modifier: u - Unicode mode (UTF-8 encoding in
PHP, PCRE_UTF8)
i : modifier: i - letters in the pattern match
both upper and lower case letters
(PCRE_CASELESS)
[^...] : character class: not matching any of the
characters (`^`); negated character class
\xE0\xE8\xEC\xF2\xF9 : the italian accented characters àèìòù written
in a stable notation (you can easily copy and
paste it for example)
Last but not least, the subject to operate on must be compatible with the characters to keep. As there can be many ways to write the same character in Unicode, the input is normalized to be compatible with the PCRE pattern:
echo preg_replace_callback(
'([^\xE0\xE8\xEC\xF2\xF9]+)ui',
$transliterate,
Normalizer::normalize($str, Normalizer::NFC)
), "\n";
The output:
ASAàèìòù Chen Hai yao Munchen Faisst Finis guo nei - jing xiang
Addendum:
\xE0\xE1\xE8\xE9\xEC\xED\xF2\xF3\xF9\xFA
lower-case list of italian accented characters (can be used with i-modifier)\xC0\xC1\xC8\xC9\xCC\xCD\xD2\xD3\xD9\xDA\xE0\xE1\xE8\xE9\xEC\xED\xF2\xF3\xF9\xFA
lower- and upper-case list of italian accented characters (can be used without i-modifier) \xhh character with hex code hh
\x{hhh..} character with hex code hhh..
Upvotes: 1
Reputation: 1357
A method I have used when trying to fend off unwanted transliteration - it's a little ugly, but works with fairly little effort. Substitute the characters you DON'T want to transliterate with tags, and then replace them after transliteration:
<?php
$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
$str = str_replace(['à', 'è', 'ì', 'ò', 'ù'], ['@@a@@', '@@e@@', '@@i@@', '@@o@@', '@@u@@'], $str);
$transliterator = Transliterator::create("Any-Latin; Latin-ASCII");
$out = $transliterator->transliterate($str);
$out = str_replace(['@@a@@', '@@e@@', '@@i@@', '@@o@@', '@@u@@'], ['à', 'è', 'ì', 'ò', 'ù'], $out);
echo $out;
The result is:
ASAàèìòù Chen Hai yao Munchen Faisst Finis guo nei - jing xiang
Upvotes: 0
Reputation: 3418
You can use preg_replace_callback
to filter all characters except the italian accented ones and apply transliteration on it.
Upvotes: 0
Reputation: 32232
All you have to do is remove the Latin-ASCII
rule.
$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
$transliterator = Transliterator::create("Any-Latin; Any-NFC");
echo $transliterator->transliterate($str);
Output:
AŠAàèìòù Chén Hǎi yáo München Faißt Finiš guó nèi - jìng xiàng
You might also want to take the opportunity to apply a normalization rule to the string to compose or decompose the accented characters into a consistent form, depending on what you plan to do with them.
$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
$none = Transliterator::create("Any-Latin");
$nfc = Transliterator::create("Any-Latin; Any-NFC");
$nfd = Transliterator::create("Any-Latin; Any-NFD");
var_dump(
$none->transliterate($str),
$nfc->transliterate($str),
$nfd->transliterate($str)
);
Output:
string(78) "AŠAàèìòù Chén Hǎi yáo München Faißt Finiš guó nèi - jìng xiàng"
string(78) "AŠAàèìòù Chén Hǎi yáo München Faißt Finiš guó nèi - jìng xiàng"
string(93) "AŠAàèìòù Chén Hǎi yáo München Faißt Finiš guó nèi - jìng xiàng"
NFC is "composed", as in all accented characters that have a single-codepoint representation are represented as such. NFD is "decomposed" and all accented characters are split into their base codepoint and an accent combining mark. In both cases, multiple combining marks on a single base character will be arranged in a consistent manner.
Some filesystems require a certain form, eg: Mac requires NFD, and some will simply accept anything, eg: ext, creating "duplicate" files with mixed composition that are tricky to deal with.
Upvotes: 0