Reputation: 30043
I'm attempting to remove accents from characters in PHP string as the first step to making the string usable in a URL.
I'm using the following code:
$input = "Fóø Bår";
setlocale(LC_ALL, "en_US.utf8");
$output = iconv("utf-8", "ascii//TRANSLIT", $input);
print($output);
The output I would expect would be something like this:
F'oo Bar
However, instead of the accented characters being transliterated they are replaced with question marks:
F?? B?r
Everything I can find online indicates that setting the locale will fix this problem, however I'm already doing this. I've already checked the following details:
locale -a
)iconv -l
)mb_check_encoding
function, as suggested in the answer by mercator)setlocale
is successful (it returns 'en_US.utf8'
rather than FALSE
)The server is using the wrong implementation of iconv. It has the glibc version instead of the required libiconv version.
Note that the iconv function on some systems may not work as you expect. In such case, it'd be a good idea to install the GNU libiconv library. It will most likely end up with more consistent results.
– PHP manual's introduction to iconv
Details about the iconv implementation that is used by PHP are included in the output of the phpinfo
function.
(I'm not able to re-compile PHP with the correct iconv library on the server I'm working with for this project so the answer I've accepted below is the one that was most useful for removing accents without iconv support.)
Upvotes: 115
Views: 226975
Reputation: 11760
All of these are wrong. https://stackoverflow.com/a/35177899/308851 gets close but it gets Latin involved and gives no source for the rule either.
Let's check the standard... the library provided by the Unicode Consortium is ICU and the documentation has this to say:
For example, to remove accents from characters, use the following transform:
NFD; [:Nonspacing Mark:] Remove; NFC.
This transform separates accents from their base characters, removes the accents, and then puts the remaining text into an unaccented form.
That's it. No need for anything else.
Thus, if you have intl
installed then you can do
$transliterator = Transliterator::createFromRules(':: NFD; :: [:Mn:] Remove; :: NFC;');
echo $transliterator->transliterate($string);
That's it. That's the answer to the question.
If you need to do this somewhere where intl
is not available, you can snapshot what it does on a machine which does have intl
:
<?php
$transliterator = \Transliterator::createFromRules(':: NFD; :: [:Mn:] Remove; :: NFC;');
$letters = preg_grep('/\pL/u', array_map('utf8', range(0x80, 0x2000)));
$letters = array_combine($letters, $letters);
$transliterated = array_map([$transliterator, 'transliterate'], $letters);
$map = array_diff_assoc($transliterated, $letters);
print count($map);
$search = [' => ', 'array (', ')', ' ', "\n"];
$replace = ['=>', '[', ']', '', ''];
$map = str_replace($search, $replace, var_export($map, TRUE));
file_put_contents("map.php", "<?php\n\$map = $map;");
function utf8($num)
{
if($num<=0x7F) return chr($num);
if($num<=0x7FF) return chr(($num>>6)+192).chr(($num&63)+128);
if($num<=0xFFFF) return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<=0x1FFFFF) return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128).chr(($num&63)+128);
return '';
}
?>
Here I have snapshotted the first 8192 characters which includes most non-Hiragana/Katakana scripts. The results are 9146 bytes long (Hiragana/Katakana more than doubles that and I didn't need it) encoding 827 replacements. You can put in your codebase and use it like this:
function removeAccents($string) {
include 'map.php';
return strtr($string, $map);
}
(this is demo, rather inline map.php into a class constant etc)
This is similar to https://stackoverflow.com/a/29280197/308851 but it includes a much larger number of characters -- and shows you how to obtain this mapping.
Ps.: Transliterating to ASCII is an entirely different bag of hurt. This is answer to "How do I remove accents from characters in a PHP string?"
Pps.: Here's the snapshot:
$map = ['À'=>'A','Á'=>'A','Â'=>'A','Ã'=>'A','Ä'=>'A','Å'=>'A','Ç'=>'C','È'=>'E','É'=>'E','Ê'=>'E','Ë'=>'E','Ì'=>'I','Í'=>'I','Î'=>'I','Ï'=>'I','Ñ'=>'N','Ò'=>'O','Ó'=>'O','Ô'=>'O','Õ'=>'O','Ö'=>'O','Ù'=>'U','Ú'=>'U','Û'=>'U','Ü'=>'U','Ý'=>'Y','à'=>'a','á'=>'a','â'=>'a','ã'=>'a','ä'=>'a','å'=>'a','ç'=>'c','è'=>'e','é'=>'e','ê'=>'e','ë'=>'e','ì'=>'i','í'=>'i','î'=>'i','ï'=>'i','ñ'=>'n','ò'=>'o','ó'=>'o','ô'=>'o','õ'=>'o','ö'=>'o','ù'=>'u','ú'=>'u','û'=>'u','ü'=>'u','ý'=>'y','ÿ'=>'y','Ā'=>'A','ā'=>'a','Ă'=>'A','ă'=>'a','Ą'=>'A','ą'=>'a','Ć'=>'C','ć'=>'c','Ĉ'=>'C','ĉ'=>'c','Ċ'=>'C','ċ'=>'c','Č'=>'C','č'=>'c','Ď'=>'D','ď'=>'d','Ē'=>'E','ē'=>'e','Ĕ'=>'E','ĕ'=>'e','Ė'=>'E','ė'=>'e','Ę'=>'E','ę'=>'e','Ě'=>'E','ě'=>'e','Ĝ'=>'G','ĝ'=>'g','Ğ'=>'G','ğ'=>'g','Ġ'=>'G','ġ'=>'g','Ģ'=>'G','ģ'=>'g','Ĥ'=>'H','ĥ'=>'h','Ĩ'=>'I','ĩ'=>'i','Ī'=>'I','ī'=>'i','Ĭ'=>'I','ĭ'=>'i','Į'=>'I','į'=>'i','İ'=>'I','Ĵ'=>'J','ĵ'=>'j','Ķ'=>'K','ķ'=>'k','Ĺ'=>'L','ĺ'=>'l','Ļ'=>'L','ļ'=>'l','Ľ'=>'L','ľ'=>'l','Ń'=>'N','ń'=>'n','Ņ'=>'N','ņ'=>'n','Ň'=>'N','ň'=>'n','Ō'=>'O','ō'=>'o','Ŏ'=>'O','ŏ'=>'o','Ő'=>'O','ő'=>'o','Ŕ'=>'R','ŕ'=>'r','Ŗ'=>'R','ŗ'=>'r','Ř'=>'R','ř'=>'r','Ś'=>'S','ś'=>'s','Ŝ'=>'S','ŝ'=>'s','Ş'=>'S','ş'=>'s','Š'=>'S','š'=>'s','Ţ'=>'T','ţ'=>'t','Ť'=>'T','ť'=>'t','Ũ'=>'U','ũ'=>'u','Ū'=>'U','ū'=>'u','Ŭ'=>'U','ŭ'=>'u','Ů'=>'U','ů'=>'u','Ű'=>'U','ű'=>'u','Ų'=>'U','ų'=>'u','Ŵ'=>'W','ŵ'=>'w','Ŷ'=>'Y','ŷ'=>'y','Ÿ'=>'Y','Ź'=>'Z','ź'=>'z','Ż'=>'Z','ż'=>'z','Ž'=>'Z','ž'=>'z','Ơ'=>'O','ơ'=>'o','Ư'=>'U','ư'=>'u','Ǎ'=>'A','ǎ'=>'a','Ǐ'=>'I','ǐ'=>'i','Ǒ'=>'O','ǒ'=>'o','Ǔ'=>'U','ǔ'=>'u','Ǖ'=>'U','ǖ'=>'u','Ǘ'=>'U','ǘ'=>'u','Ǚ'=>'U','ǚ'=>'u','Ǜ'=>'U','ǜ'=>'u','Ǟ'=>'A','ǟ'=>'a','Ǡ'=>'A','ǡ'=>'a','Ǣ'=>'Æ','ǣ'=>'æ','Ǧ'=>'G','ǧ'=>'g','Ǩ'=>'K','ǩ'=>'k','Ǫ'=>'O','ǫ'=>'o','Ǭ'=>'O','ǭ'=>'o','Ǯ'=>'Ʒ','ǯ'=>'ʒ','ǰ'=>'j','Ǵ'=>'G','ǵ'=>'g','Ǹ'=>'N','ǹ'=>'n','Ǻ'=>'A','ǻ'=>'a','Ǽ'=>'Æ','ǽ'=>'æ','Ǿ'=>'Ø','ǿ'=>'ø','Ȁ'=>'A','ȁ'=>'a','Ȃ'=>'A','ȃ'=>'a','Ȅ'=>'E','ȅ'=>'e','Ȇ'=>'E','ȇ'=>'e','Ȉ'=>'I','ȉ'=>'i','Ȋ'=>'I','ȋ'=>'i','Ȍ'=>'O','ȍ'=>'o','Ȏ'=>'O','ȏ'=>'o','Ȑ'=>'R','ȑ'=>'r','Ȓ'=>'R','ȓ'=>'r','Ȕ'=>'U','ȕ'=>'u','Ȗ'=>'U','ȗ'=>'u','Ș'=>'S','ș'=>'s','Ț'=>'T','ț'=>'t','Ȟ'=>'H','ȟ'=>'h','Ȧ'=>'A','ȧ'=>'a','Ȩ'=>'E','ȩ'=>'e','Ȫ'=>'O','ȫ'=>'o','Ȭ'=>'O','ȭ'=>'o','Ȯ'=>'O','ȯ'=>'o','Ȱ'=>'O','ȱ'=>'o','Ȳ'=>'Y','ȳ'=>'y','ʹ'=>'ʹ','Ά'=>'Α','Έ'=>'Ε','Ή'=>'Η','Ί'=>'Ι','Ό'=>'Ο','Ύ'=>'Υ','Ώ'=>'Ω','ΐ'=>'ι','Ϊ'=>'Ι','Ϋ'=>'Υ','ά'=>'α','έ'=>'ε','ή'=>'η','ί'=>'ι','ΰ'=>'υ','ϊ'=>'ι','ϋ'=>'υ','ό'=>'ο','ύ'=>'υ','ώ'=>'ω','ϓ'=>'ϒ','ϔ'=>'ϒ','Ѐ'=>'Е','Ё'=>'Е','Ѓ'=>'Г','Ї'=>'І','Ќ'=>'К','Ѝ'=>'И','Ў'=>'У','Й'=>'И','й'=>'и','ѐ'=>'е','ё'=>'е','ѓ'=>'г','ї'=>'і','ќ'=>'к','ѝ'=>'и','ў'=>'у','Ѷ'=>'Ѵ','ѷ'=>'ѵ','Ӂ'=>'Ж','ӂ'=>'ж','Ӑ'=>'А','ӑ'=>'а','Ӓ'=>'А','ӓ'=>'а','Ӗ'=>'Е','ӗ'=>'е','Ӛ'=>'Ә','ӛ'=>'ә','Ӝ'=>'Ж','ӝ'=>'ж','Ӟ'=>'З','ӟ'=>'з','Ӣ'=>'И','ӣ'=>'и','Ӥ'=>'И','ӥ'=>'и','Ӧ'=>'О','ӧ'=>'о','Ӫ'=>'Ө','ӫ'=>'ө','Ӭ'=>'Э','ӭ'=>'э','Ӯ'=>'У','ӯ'=>'у','Ӱ'=>'У','ӱ'=>'у','Ӳ'=>'У','ӳ'=>'у','Ӵ'=>'Ч','ӵ'=>'ч','Ӹ'=>'Ы','ӹ'=>'ы','آ'=>'ا','أ'=>'ا','ؤ'=>'و','إ'=>'ا','ئ'=>'ي','ۀ'=>'ە','ۂ'=>'ہ','ۓ'=>'ے','ऩ'=>'न','ऱ'=>'र','ऴ'=>'ळ','क़'=>'क','ख़'=>'ख','ग़'=>'ग','ज़'=>'ज','ड़'=>'ड','ढ़'=>'ढ','फ़'=>'फ','य़'=>'य','ড়'=>'ড','ঢ়'=>'ঢ','য়'=>'য','ਲ਼'=>'ਲ','ਸ਼'=>'ਸ','ਖ਼'=>'ਖ','ਗ਼'=>'ਗ','ਜ਼'=>'ਜ','ਫ਼'=>'ਫ','ଡ଼'=>'ଡ','ଢ଼'=>'ଢ','གྷ'=>'ག','ཌྷ'=>'ཌ','དྷ'=>'ད','བྷ'=>'བ','ཛྷ'=>'ཛ','ཀྵ'=>'ཀ','ဦ'=>'ဥ','Ḁ'=>'A','ḁ'=>'a','Ḃ'=>'B','ḃ'=>'b','Ḅ'=>'B','ḅ'=>'b','Ḇ'=>'B','ḇ'=>'b','Ḉ'=>'C','ḉ'=>'c','Ḋ'=>'D','ḋ'=>'d','Ḍ'=>'D','ḍ'=>'d','Ḏ'=>'D','ḏ'=>'d','Ḑ'=>'D','ḑ'=>'d','Ḓ'=>'D','ḓ'=>'d','Ḕ'=>'E','ḕ'=>'e','Ḗ'=>'E','ḗ'=>'e','Ḙ'=>'E','ḙ'=>'e','Ḛ'=>'E','ḛ'=>'e','Ḝ'=>'E','ḝ'=>'e','Ḟ'=>'F','ḟ'=>'f','Ḡ'=>'G','ḡ'=>'g','Ḣ'=>'H','ḣ'=>'h','Ḥ'=>'H','ḥ'=>'h','Ḧ'=>'H','ḧ'=>'h','Ḩ'=>'H','ḩ'=>'h','Ḫ'=>'H','ḫ'=>'h','Ḭ'=>'I','ḭ'=>'i','Ḯ'=>'I','ḯ'=>'i','Ḱ'=>'K','ḱ'=>'k','Ḳ'=>'K','ḳ'=>'k','Ḵ'=>'K','ḵ'=>'k','Ḷ'=>'L','ḷ'=>'l','Ḹ'=>'L','ḹ'=>'l','Ḻ'=>'L','ḻ'=>'l','Ḽ'=>'L','ḽ'=>'l','Ḿ'=>'M','ḿ'=>'m','Ṁ'=>'M','ṁ'=>'m','Ṃ'=>'M','ṃ'=>'m','Ṅ'=>'N','ṅ'=>'n','Ṇ'=>'N','ṇ'=>'n','Ṉ'=>'N','ṉ'=>'n','Ṋ'=>'N','ṋ'=>'n','Ṍ'=>'O','ṍ'=>'o','Ṏ'=>'O','ṏ'=>'o','Ṑ'=>'O','ṑ'=>'o','Ṓ'=>'O','ṓ'=>'o','Ṕ'=>'P','ṕ'=>'p','Ṗ'=>'P','ṗ'=>'p','Ṙ'=>'R','ṙ'=>'r','Ṛ'=>'R','ṛ'=>'r','Ṝ'=>'R','ṝ'=>'r','Ṟ'=>'R','ṟ'=>'r','Ṡ'=>'S','ṡ'=>'s','Ṣ'=>'S','ṣ'=>'s','Ṥ'=>'S','ṥ'=>'s','Ṧ'=>'S','ṧ'=>'s','Ṩ'=>'S','ṩ'=>'s','Ṫ'=>'T','ṫ'=>'t','Ṭ'=>'T','ṭ'=>'t','Ṯ'=>'T','ṯ'=>'t','Ṱ'=>'T','ṱ'=>'t','Ṳ'=>'U','ṳ'=>'u','Ṵ'=>'U','ṵ'=>'u','Ṷ'=>'U','ṷ'=>'u','Ṹ'=>'U','ṹ'=>'u','Ṻ'=>'U','ṻ'=>'u','Ṽ'=>'V','ṽ'=>'v','Ṿ'=>'V','ṿ'=>'v','Ẁ'=>'W','ẁ'=>'w','Ẃ'=>'W','ẃ'=>'w','Ẅ'=>'W','ẅ'=>'w','Ẇ'=>'W','ẇ'=>'w','Ẉ'=>'W','ẉ'=>'w','Ẋ'=>'X','ẋ'=>'x','Ẍ'=>'X','ẍ'=>'x','Ẏ'=>'Y','ẏ'=>'y','Ẑ'=>'Z','ẑ'=>'z','Ẓ'=>'Z','ẓ'=>'z','Ẕ'=>'Z','ẕ'=>'z','ẖ'=>'h','ẗ'=>'t','ẘ'=>'w','ẙ'=>'y','ẛ'=>'ſ','Ạ'=>'A','ạ'=>'a','Ả'=>'A','ả'=>'a','Ấ'=>'A','ấ'=>'a','Ầ'=>'A','ầ'=>'a','Ẩ'=>'A','ẩ'=>'a','Ẫ'=>'A','ẫ'=>'a','Ậ'=>'A','ậ'=>'a','Ắ'=>'A','ắ'=>'a','Ằ'=>'A','ằ'=>'a','Ẳ'=>'A','ẳ'=>'a','Ẵ'=>'A','ẵ'=>'a','Ặ'=>'A','ặ'=>'a','Ẹ'=>'E','ẹ'=>'e','Ẻ'=>'E','ẻ'=>'e','Ẽ'=>'E','ẽ'=>'e','Ế'=>'E','ế'=>'e','Ề'=>'E','ề'=>'e','Ể'=>'E','ể'=>'e','Ễ'=>'E','ễ'=>'e','Ệ'=>'E','ệ'=>'e','Ỉ'=>'I','ỉ'=>'i','Ị'=>'I','ị'=>'i','Ọ'=>'O','ọ'=>'o','Ỏ'=>'O','ỏ'=>'o','Ố'=>'O','ố'=>'o','Ồ'=>'O','ồ'=>'o','Ổ'=>'O','ổ'=>'o','Ỗ'=>'O','ỗ'=>'o','Ộ'=>'O','ộ'=>'o','Ớ'=>'O','ớ'=>'o','Ờ'=>'O','ờ'=>'o','Ở'=>'O','ở'=>'o','Ỡ'=>'O','ỡ'=>'o','Ợ'=>'O','ợ'=>'o','Ụ'=>'U','ụ'=>'u','Ủ'=>'U','ủ'=>'u','Ứ'=>'U','ứ'=>'u','Ừ'=>'U','ừ'=>'u','Ử'=>'U','ử'=>'u','Ữ'=>'U','ữ'=>'u','Ự'=>'U','ự'=>'u','Ỳ'=>'Y','ỳ'=>'y','Ỵ'=>'Y','ỵ'=>'y','Ỷ'=>'Y','ỷ'=>'y','Ỹ'=>'Y','ỹ'=>'y','ἀ'=>'α','ἁ'=>'α','ἂ'=>'α','ἃ'=>'α','ἄ'=>'α','ἅ'=>'α','ἆ'=>'α','ἇ'=>'α','Ἀ'=>'Α','Ἁ'=>'Α','Ἂ'=>'Α','Ἃ'=>'Α','Ἄ'=>'Α','Ἅ'=>'Α','Ἆ'=>'Α','Ἇ'=>'Α','ἐ'=>'ε','ἑ'=>'ε','ἒ'=>'ε','ἓ'=>'ε','ἔ'=>'ε','ἕ'=>'ε','Ἐ'=>'Ε','Ἑ'=>'Ε','Ἒ'=>'Ε','Ἓ'=>'Ε','Ἔ'=>'Ε','Ἕ'=>'Ε','ἠ'=>'η','ἡ'=>'η','ἢ'=>'η','ἣ'=>'η','ἤ'=>'η','ἥ'=>'η','ἦ'=>'η','ἧ'=>'η','Ἠ'=>'Η','Ἡ'=>'Η','Ἢ'=>'Η','Ἣ'=>'Η','Ἤ'=>'Η','Ἥ'=>'Η','Ἦ'=>'Η','Ἧ'=>'Η','ἰ'=>'ι','ἱ'=>'ι','ἲ'=>'ι','ἳ'=>'ι','ἴ'=>'ι','ἵ'=>'ι','ἶ'=>'ι','ἷ'=>'ι','Ἰ'=>'Ι','Ἱ'=>'Ι','Ἲ'=>'Ι','Ἳ'=>'Ι','Ἴ'=>'Ι','Ἵ'=>'Ι','Ἶ'=>'Ι','Ἷ'=>'Ι','ὀ'=>'ο','ὁ'=>'ο','ὂ'=>'ο','ὃ'=>'ο','ὄ'=>'ο','ὅ'=>'ο','Ὀ'=>'Ο','Ὁ'=>'Ο','Ὂ'=>'Ο','Ὃ'=>'Ο','Ὄ'=>'Ο','Ὅ'=>'Ο','ὐ'=>'υ','ὑ'=>'υ','ὒ'=>'υ','ὓ'=>'υ','ὔ'=>'υ','ὕ'=>'υ','ὖ'=>'υ','ὗ'=>'υ','Ὑ'=>'Υ','Ὓ'=>'Υ','Ὕ'=>'Υ','Ὗ'=>'Υ','ὠ'=>'ω','ὡ'=>'ω','ὢ'=>'ω','ὣ'=>'ω','ὤ'=>'ω','ὥ'=>'ω','ὦ'=>'ω','ὧ'=>'ω','Ὠ'=>'Ω','Ὡ'=>'Ω','Ὢ'=>'Ω','Ὣ'=>'Ω','Ὤ'=>'Ω','Ὥ'=>'Ω','Ὦ'=>'Ω','Ὧ'=>'Ω','ὰ'=>'α','ά'=>'α','ὲ'=>'ε','έ'=>'ε','ὴ'=>'η','ή'=>'η','ὶ'=>'ι','ί'=>'ι','ὸ'=>'ο','ό'=>'ο','ὺ'=>'υ','ύ'=>'υ','ὼ'=>'ω','ώ'=>'ω','ᾀ'=>'α','ᾁ'=>'α','ᾂ'=>'α','ᾃ'=>'α','ᾄ'=>'α','ᾅ'=>'α','ᾆ'=>'α','ᾇ'=>'α','ᾈ'=>'Α','ᾉ'=>'Α','ᾊ'=>'Α','ᾋ'=>'Α','ᾌ'=>'Α','ᾍ'=>'Α','ᾎ'=>'Α','ᾏ'=>'Α','ᾐ'=>'η','ᾑ'=>'η','ᾒ'=>'η','ᾓ'=>'η','ᾔ'=>'η','ᾕ'=>'η','ᾖ'=>'η','ᾗ'=>'η','ᾘ'=>'Η','ᾙ'=>'Η','ᾚ'=>'Η','ᾛ'=>'Η','ᾜ'=>'Η','ᾝ'=>'Η','ᾞ'=>'Η','ᾟ'=>'Η','ᾠ'=>'ω','ᾡ'=>'ω','ᾢ'=>'ω','ᾣ'=>'ω','ᾤ'=>'ω','ᾥ'=>'ω','ᾦ'=>'ω','ᾧ'=>'ω','ᾨ'=>'Ω','ᾩ'=>'Ω','ᾪ'=>'Ω','ᾫ'=>'Ω','ᾬ'=>'Ω','ᾭ'=>'Ω','ᾮ'=>'Ω','ᾯ'=>'Ω','ᾰ'=>'α','ᾱ'=>'α','ᾲ'=>'α','ᾳ'=>'α','ᾴ'=>'α','ᾶ'=>'α','ᾷ'=>'α','Ᾰ'=>'Α','Ᾱ'=>'Α','Ὰ'=>'Α','Ά'=>'Α','ᾼ'=>'Α','ι'=>'ι','ῂ'=>'η','ῃ'=>'η','ῄ'=>'η','ῆ'=>'η','ῇ'=>'η','Ὲ'=>'Ε','Έ'=>'Ε','Ὴ'=>'Η','Ή'=>'Η','ῌ'=>'Η','ῐ'=>'ι','ῑ'=>'ι','ῒ'=>'ι','ΐ'=>'ι','ῖ'=>'ι','ῗ'=>'ι','Ῐ'=>'Ι','Ῑ'=>'Ι','Ὶ'=>'Ι','Ί'=>'Ι','ῠ'=>'υ','ῡ'=>'υ','ῢ'=>'υ','ΰ'=>'υ','ῤ'=>'ρ','ῥ'=>'ρ','ῦ'=>'υ','ῧ'=>'υ','Ῠ'=>'Υ','Ῡ'=>'Υ','Ὺ'=>'Υ','Ύ'=>'Υ','Ῥ'=>'Ρ','ῲ'=>'ω','ῳ'=>'ω','ῴ'=>'ω','ῶ'=>'ω','ῷ'=>'ω','Ὸ'=>'Ο','Ό'=>'Ο','Ὼ'=>'Ω','Ώ'=>'Ω','ῼ'=>'Ω'];
Upvotes: 4
Reputation: 499
Old topics but add what I found working very weel for me. You can customise the transliterator for your needs.
$transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Upper(); :: NFC;', Transliterator::FORWARD);
return $transliterator->transliterate('çÇæ λώπηξ-- É&é-è_çà=@46/,*')
Output: "CCAE LOPEX-- E&E-E_CA=@46/,*";
Doc: https://www.php.net/manual/en/class.transliterator.php
Upvotes: 0
Reputation: 386
Since Symfony 5+
composer require symfony/string
use Symfony\Component\String\Slugger\AsciiSlugger;
$slugger = new AsciiSlugger();
$slug = $slugger->slug('Wôrķšƥáçè ~~sèťtïñğš~~');
// $slug = 'Workspace-settings'
More: https://symfony.com/doc/current/components/string.html#slugger
Upvotes: 1
Reputation: 41
function Unaccent($string){
return preg_replace('~&([a-z]{1,2}) (acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_QUOTES, 'UTF-8'));
}
Upvotes: 2
Reputation: 3049
Combining some of the answers:
/**
* Given an utf8 string, returns an ascii string.
* Only supports ISO-8859-1 chars, not chars like č, œ, Œ, ř, Š, š, ů, Ÿ or ž
* @param string $utf8
* @return string ascii
*/
function stripAccents(string $utf8): string
{
$src = 'àáâãäåçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ';
$dst = 'aaaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY';
$iso8859_1 = strtr(
utf8_decode($utf8),
utf8_decode($src),
$dst
);
// iconv below not used for UTF-8 directly because it adds chars like ^'"`
// which are nice to not lose info, but not very readable and not compatible with filenames
// iconv() additionally converts chars like
// ¡ ¢ £ ¥ ¦ § ¨ © ª « ¬ SHY ® ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ Å Æ Ð Ø Þ ß æ ÷ ø þ
// ! c lb yen | SS " (c) a << not SHY (R) ^0 +/- ^2 ^3 ' u P . , ^1 o >> 1/4 1/2 3/4 ? A AE D O Th ss ae d : o th
return iconv('ISO-8859-1', 'ASCII//TRANSLIT//IGNORE', $iso8859_1);
}
Upvotes: 1
Reputation: 48091
What about the WordPress implementation?
function remove_accents($string) {
if ( !preg_match('/[\x80-\xff]/', $string) )
return $string;
$chars = array(
// Decompositions for Latin-1 Supplement
chr(195).chr(128) => 'A', chr(195).chr(129) => 'A',
chr(195).chr(130) => 'A', chr(195).chr(131) => 'A',
chr(195).chr(132) => 'A', chr(195).chr(133) => 'A',
chr(195).chr(135) => 'C', chr(195).chr(136) => 'E',
chr(195).chr(137) => 'E', chr(195).chr(138) => 'E',
chr(195).chr(139) => 'E', chr(195).chr(140) => 'I',
chr(195).chr(141) => 'I', chr(195).chr(142) => 'I',
chr(195).chr(143) => 'I', chr(195).chr(145) => 'N',
chr(195).chr(146) => 'O', chr(195).chr(147) => 'O',
chr(195).chr(148) => 'O', chr(195).chr(149) => 'O',
chr(195).chr(150) => 'O', chr(195).chr(153) => 'U',
chr(195).chr(154) => 'U', chr(195).chr(155) => 'U',
chr(195).chr(156) => 'U', chr(195).chr(157) => 'Y',
chr(195).chr(159) => 's', chr(195).chr(160) => 'a',
chr(195).chr(161) => 'a', chr(195).chr(162) => 'a',
chr(195).chr(163) => 'a', chr(195).chr(164) => 'a',
chr(195).chr(165) => 'a', chr(195).chr(167) => 'c',
chr(195).chr(168) => 'e', chr(195).chr(169) => 'e',
chr(195).chr(170) => 'e', chr(195).chr(171) => 'e',
chr(195).chr(172) => 'i', chr(195).chr(173) => 'i',
chr(195).chr(174) => 'i', chr(195).chr(175) => 'i',
chr(195).chr(177) => 'n', chr(195).chr(178) => 'o',
chr(195).chr(179) => 'o', chr(195).chr(180) => 'o',
chr(195).chr(181) => 'o', chr(195).chr(182) => 'o',
chr(195).chr(182) => 'o', chr(195).chr(185) => 'u',
chr(195).chr(186) => 'u', chr(195).chr(187) => 'u',
chr(195).chr(188) => 'u', chr(195).chr(189) => 'y',
chr(195).chr(191) => 'y',
// Decompositions for Latin Extended-A
chr(196).chr(128) => 'A', chr(196).chr(129) => 'a',
chr(196).chr(130) => 'A', chr(196).chr(131) => 'a',
chr(196).chr(132) => 'A', chr(196).chr(133) => 'a',
chr(196).chr(134) => 'C', chr(196).chr(135) => 'c',
chr(196).chr(136) => 'C', chr(196).chr(137) => 'c',
chr(196).chr(138) => 'C', chr(196).chr(139) => 'c',
chr(196).chr(140) => 'C', chr(196).chr(141) => 'c',
chr(196).chr(142) => 'D', chr(196).chr(143) => 'd',
chr(196).chr(144) => 'D', chr(196).chr(145) => 'd',
chr(196).chr(146) => 'E', chr(196).chr(147) => 'e',
chr(196).chr(148) => 'E', chr(196).chr(149) => 'e',
chr(196).chr(150) => 'E', chr(196).chr(151) => 'e',
chr(196).chr(152) => 'E', chr(196).chr(153) => 'e',
chr(196).chr(154) => 'E', chr(196).chr(155) => 'e',
chr(196).chr(156) => 'G', chr(196).chr(157) => 'g',
chr(196).chr(158) => 'G', chr(196).chr(159) => 'g',
chr(196).chr(160) => 'G', chr(196).chr(161) => 'g',
chr(196).chr(162) => 'G', chr(196).chr(163) => 'g',
chr(196).chr(164) => 'H', chr(196).chr(165) => 'h',
chr(196).chr(166) => 'H', chr(196).chr(167) => 'h',
chr(196).chr(168) => 'I', chr(196).chr(169) => 'i',
chr(196).chr(170) => 'I', chr(196).chr(171) => 'i',
chr(196).chr(172) => 'I', chr(196).chr(173) => 'i',
chr(196).chr(174) => 'I', chr(196).chr(175) => 'i',
chr(196).chr(176) => 'I', chr(196).chr(177) => 'i',
chr(196).chr(178) => 'IJ',chr(196).chr(179) => 'ij',
chr(196).chr(180) => 'J', chr(196).chr(181) => 'j',
chr(196).chr(182) => 'K', chr(196).chr(183) => 'k',
chr(196).chr(184) => 'k', chr(196).chr(185) => 'L',
chr(196).chr(186) => 'l', chr(196).chr(187) => 'L',
chr(196).chr(188) => 'l', chr(196).chr(189) => 'L',
chr(196).chr(190) => 'l', chr(196).chr(191) => 'L',
chr(197).chr(128) => 'l', chr(197).chr(129) => 'L',
chr(197).chr(130) => 'l', chr(197).chr(131) => 'N',
chr(197).chr(132) => 'n', chr(197).chr(133) => 'N',
chr(197).chr(134) => 'n', chr(197).chr(135) => 'N',
chr(197).chr(136) => 'n', chr(197).chr(137) => 'N',
chr(197).chr(138) => 'n', chr(197).chr(139) => 'N',
chr(197).chr(140) => 'O', chr(197).chr(141) => 'o',
chr(197).chr(142) => 'O', chr(197).chr(143) => 'o',
chr(197).chr(144) => 'O', chr(197).chr(145) => 'o',
chr(197).chr(146) => 'OE',chr(197).chr(147) => 'oe',
chr(197).chr(148) => 'R',chr(197).chr(149) => 'r',
chr(197).chr(150) => 'R',chr(197).chr(151) => 'r',
chr(197).chr(152) => 'R',chr(197).chr(153) => 'r',
chr(197).chr(154) => 'S',chr(197).chr(155) => 's',
chr(197).chr(156) => 'S',chr(197).chr(157) => 's',
chr(197).chr(158) => 'S',chr(197).chr(159) => 's',
chr(197).chr(160) => 'S', chr(197).chr(161) => 's',
chr(197).chr(162) => 'T', chr(197).chr(163) => 't',
chr(197).chr(164) => 'T', chr(197).chr(165) => 't',
chr(197).chr(166) => 'T', chr(197).chr(167) => 't',
chr(197).chr(168) => 'U', chr(197).chr(169) => 'u',
chr(197).chr(170) => 'U', chr(197).chr(171) => 'u',
chr(197).chr(172) => 'U', chr(197).chr(173) => 'u',
chr(197).chr(174) => 'U', chr(197).chr(175) => 'u',
chr(197).chr(176) => 'U', chr(197).chr(177) => 'u',
chr(197).chr(178) => 'U', chr(197).chr(179) => 'u',
chr(197).chr(180) => 'W', chr(197).chr(181) => 'w',
chr(197).chr(182) => 'Y', chr(197).chr(183) => 'y',
chr(197).chr(184) => 'Y', chr(197).chr(185) => 'Z',
chr(197).chr(186) => 'z', chr(197).chr(187) => 'Z',
chr(197).chr(188) => 'z', chr(197).chr(189) => 'Z',
chr(197).chr(190) => 'z', chr(197).chr(191) => 's'
);
$string = strtr($string, $chars);
return $string;
}
To understand what this function does, check the conversion table:
À => A
Á => A
 => A
à => A
Ä => A
Å => A
Ç => C
È => E
É => E
Ê => E
Ë => E
Ì => I
Í => I
Î => I
Ï => I
Ñ => N
Ò => O
Ó => O
Ô => O
Õ => O
Ö => O
Ù => U
Ú => U
Û => U
Ü => U
Ý => Y
ß => s
à => a
á => a
â => a
ã => a
ä => a
å => a
ç => c
è => e
é => e
ê => e
ë => e
ì => i
í => i
î => i
ï => i
ñ => n
ò => o
ó => o
ô => o
õ => o
ö => o
ù => u
ú => u
û => u
ü => u
ý => y
ÿ => y
Ā => A
ā => a
Ă => A
ă => a
Ą => A
ą => a
Ć => C
ć => c
Ĉ => C
ĉ => c
Ċ => C
ċ => c
Č => C
č => c
Ď => D
ď => d
Đ => D
đ => d
Ē => E
ē => e
Ĕ => E
ĕ => e
Ė => E
ė => e
Ę => E
ę => e
Ě => E
ě => e
Ĝ => G
ĝ => g
Ğ => G
ğ => g
Ġ => G
ġ => g
Ģ => G
ģ => g
Ĥ => H
ĥ => h
Ħ => H
ħ => h
Ĩ => I
ĩ => i
Ī => I
ī => i
Ĭ => I
ĭ => i
Į => I
į => i
İ => I
ı => i
IJ => IJ
ij => ij
Ĵ => J
ĵ => j
Ķ => K
ķ => k
ĸ => k
Ĺ => L
ĺ => l
Ļ => L
ļ => l
Ľ => L
ľ => l
Ŀ => L
ŀ => l
Ł => L
ł => l
Ń => N
ń => n
Ņ => N
ņ => n
Ň => N
ň => n
ʼn => N
Ŋ => n
ŋ => N
Ō => O
ō => o
Ŏ => O
ŏ => o
Ő => O
ő => o
Œ => OE
œ => oe
Ŕ => R
ŕ => r
Ŗ => R
ŗ => r
Ř => R
ř => r
Ś => S
ś => s
Ŝ => S
ŝ => s
Ş => S
ş => s
Š => S
š => s
Ţ => T
ţ => t
Ť => T
ť => t
Ŧ => T
ŧ => t
Ũ => U
ũ => u
Ū => U
ū => u
Ŭ => U
ŭ => u
Ů => U
ů => u
Ű => U
ű => u
Ų => U
ų => u
Ŵ => W
ŵ => w
Ŷ => Y
ŷ => y
Ÿ => Y
Ź => Z
ź => z
Ż => Z
ż => z
Ž => Z
ž => z
ſ => s
You can generate the conversion table yourself by simply iterating over the $chars
array of the function:
foreach($chars as $k=>$v) {
printf("%s -> %s", $k, $v);
}
Upvotes: 130
Reputation: 31
This answer I've got following tips here, so it is not really mine. It works for me using LATIN1 or UTF-8. If you use other charsets, you probably should add them to mb_detect_encoding
function. Correct environment set is probably needed also.
function NoAccents($s){
return iconv(mb_detect_encoding($s,'UTF-8, ASCII, ISO-8859-1'),'ASCII//TRANSLIT//INGORE',$s);
}
Upvotes: 1
Reputation: 913
What's wrong with this one? Works with UTF8
function strip_accents($s){
return str_replace(
explode(' ', preg_replace('/ +/', ' ', 'č ć ž š đ Č Ć Ž Š Đ à á â ã ä ç è é ê ë ì í î ï ñ ò ó ô õ ö ù ú û ü ý ÿ À Á Â Ã Ä Ç È É Ê Ë Ì Í Î Ï Ñ Ò Ó Ô Õ Ö Ù Ú Û Ü Ý')),
explode(' ', preg_replace('/ +/', ' ', 'c c z s dj C C Z S DJ a a a a a c e e e e i i i i n o o o o o u u u u y y A A A A A C E E E E I I I I N O O O O O U U U U Y')),
$s);
}
It can be faster by not using preg_replace
, but speed was not my goal here.
Upvotes: 4
Reputation: 539
If the main task is just to use the string in a URL, why not to use slugyfier?
composer require cocur/slugify
then
use Cocur\Slugify\Slugify;
$slugify = new Slugify();
echo $slugify->slugify('Fóø Bår');
It also has many bridges for popular frameworks. E.g. you can use Doctrine Extensions Sluggable behaviour to generate automatically unique slug for each entity in DB and use it in URL.
If you want just to wipe out all accents you can play around with rulesets to satisfy the requirements.
Upvotes: 2
Reputation: 101
Something like this?
$arrSearch = explode(","," ,ç,æ, œ, á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,e,i,ø,u");
$arrReplace = explode(",","_,c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,e,i,o,u");
$output = str_replace($arrSearch, $arrReplace, $input);
Upvotes: 2
Reputation: 257
In laravel you can simply use str_slug($accentedPhrase)
and if you care about dash (-) that this method substitute with space you can use str_replace('-', ' ', str_slug($accentedPhrase))
Upvotes: 2
Reputation: 864
Based on @Mimouni answer I made this function to transliterate Accented strings to Non Accented strings.
/**
* @param $str Convert string to lowercase and replace special chars to equivalents ou remove its
* @return string
*/
function _slugify(string $string): string
{
$str = $string; // for comparisons
$str = _toUtf8($str); // Force to work with string in UTF-8
$str = iconv('UTF-8', 'ASCII//TRANSLIT', $str);
if ($str != htmlentities($string, ENT_QUOTES, 'UTF-8')) { // iconv fails
$str = _toUtf8($string);
$str = htmlentities($str, ENT_QUOTES, 'UTF-8');
$str = preg_replace('#&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);#i', '$1', $str);
// Need to strip non ASCII chars or any other than a-z, A-Z, 0-9...
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
$str = preg_replace(array('#[^0-9a-z]#i', '#[ -]+#'), ' ', $str);
$str = trim($str, ' -');
}
// lowercase
$string = strtolower($str);
return $string;
}
To convert strings to UTF-8, here I use the Multi Byte String extension. Note that I break string in pieces to avoid trouble with mixed content (I have such situation) and convert word by word.
/**
* @param $str string String in any encoding
* @return string
*/
function _toUtf8(string $str_in): ?string
{
if (!function_exists('mb_detect_encoding')) {
throw new \Exception('The Multi Byte String extension is absent!');
}
$str_out = [];
$words = explode(" ", $str_in);
foreach ($words as $word) {
$current_encoding = mb_detect_encoding($word, 'UTF-8, ASCII, ISO-8859-1');
$str_out[] = mb_convert_encoding($word, 'UTF-8', $current_encoding);
}
return implode(" ", $str_out);
}
Footer Notes: Was the only solution that pass in PHPUnit UnitTests in Windows command Line (locale issues) The @gabo solution should work but unfortunately not for me
Upvotes: 0
Reputation: 2950
The easiest way is to use iconv()
PHP native function.
echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', "Thîs îs à vêry wrong séntènce!");
// output: This is a very wrong sentence!
Upvotes: 26
Reputation: 1686
if you have http://php.net/manual/en/book.intl.php available, this solved your problem
$string = "Fóø Bår";
$transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $normalized = $transliterator->transliterate($string);
Upvotes: 55
Reputation: 10017
I just created a removeAccents method based on the reading of this thread and this other one too (How to remove accents and turn letters into "plain" ASCII characters?).
The method is here: https://github.com/lingtalfi/Bat/blob/master/StringTool.md#removeaccents
Tests are here: https://github.com/lingtalfi/Bat/blob/master/btests/StringTool/removeAccents/stringTool.removeAccents.test.php,
and here is what was tested so far:
$a = [
// easy
'',
'a',
'après',
'dédé fait la fête ?',
// hard
'àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ',
'ŻŹĆŃĄŚŁĘÓżźćńąśłęó',
'qqqqŻŹĆŃĄŚŁĘÓżźćńąśłęóqqq',
'ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïðñòóôõöøùúûüýÿ',
'ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ',
'ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİĴĵĶķ',
'ĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž',
'ſƒƠơƯưǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǺǻǾǿ',
'Ǽǽ',
];
and it converts only accentuated things (letters/ligatures/cédilles/some letters with a line through/...?).
Here is the content of the method: (https://github.com/lingtalfi/Bat/blob/master/StringTool.php#L83)
public static function removeAccents($str)
{
static $map = [
// single letters
'à' => 'a',
'á' => 'a',
'â' => 'a',
'ã' => 'a',
'ä' => 'a',
'ą' => 'a',
'å' => 'a',
'ā' => 'a',
'ă' => 'a',
'ǎ' => 'a',
'ǻ' => 'a',
'À' => 'A',
'Á' => 'A',
'Â' => 'A',
'Ã' => 'A',
'Ä' => 'A',
'Ą' => 'A',
'Å' => 'A',
'Ā' => 'A',
'Ă' => 'A',
'Ǎ' => 'A',
'Ǻ' => 'A',
'ç' => 'c',
'ć' => 'c',
'ĉ' => 'c',
'ċ' => 'c',
'č' => 'c',
'Ç' => 'C',
'Ć' => 'C',
'Ĉ' => 'C',
'Ċ' => 'C',
'Č' => 'C',
'ď' => 'd',
'đ' => 'd',
'Ð' => 'D',
'Ď' => 'D',
'Đ' => 'D',
'è' => 'e',
'é' => 'e',
'ê' => 'e',
'ë' => 'e',
'ę' => 'e',
'ē' => 'e',
'ĕ' => 'e',
'ė' => 'e',
'ě' => 'e',
'È' => 'E',
'É' => 'E',
'Ê' => 'E',
'Ë' => 'E',
'Ę' => 'E',
'Ē' => 'E',
'Ĕ' => 'E',
'Ė' => 'E',
'Ě' => 'E',
'ƒ' => 'f',
'ĝ' => 'g',
'ğ' => 'g',
'ġ' => 'g',
'ģ' => 'g',
'Ĝ' => 'G',
'Ğ' => 'G',
'Ġ' => 'G',
'Ģ' => 'G',
'ĥ' => 'h',
'ħ' => 'h',
'Ĥ' => 'H',
'Ħ' => 'H',
'ì' => 'i',
'í' => 'i',
'î' => 'i',
'ï' => 'i',
'ĩ' => 'i',
'ī' => 'i',
'ĭ' => 'i',
'į' => 'i',
'ſ' => 'i',
'ǐ' => 'i',
'Ì' => 'I',
'Í' => 'I',
'Î' => 'I',
'Ï' => 'I',
'Ĩ' => 'I',
'Ī' => 'I',
'Ĭ' => 'I',
'Į' => 'I',
'İ' => 'I',
'Ǐ' => 'I',
'ĵ' => 'j',
'Ĵ' => 'J',
'ķ' => 'k',
'Ķ' => 'K',
'ł' => 'l',
'ĺ' => 'l',
'ļ' => 'l',
'ľ' => 'l',
'ŀ' => 'l',
'Ł' => 'L',
'Ĺ' => 'L',
'Ļ' => 'L',
'Ľ' => 'L',
'Ŀ' => 'L',
'ñ' => 'n',
'ń' => 'n',
'ņ' => 'n',
'ň' => 'n',
'ʼn' => 'n',
'Ñ' => 'N',
'Ń' => 'N',
'Ņ' => 'N',
'Ň' => 'N',
'ò' => 'o',
'ó' => 'o',
'ô' => 'o',
'õ' => 'o',
'ö' => 'o',
'ð' => 'o',
'ø' => 'o',
'ō' => 'o',
'ŏ' => 'o',
'ő' => 'o',
'ơ' => 'o',
'ǒ' => 'o',
'ǿ' => 'o',
'Ò' => 'O',
'Ó' => 'O',
'Ô' => 'O',
'Õ' => 'O',
'Ö' => 'O',
'Ø' => 'O',
'Ō' => 'O',
'Ŏ' => 'O',
'Ő' => 'O',
'Ơ' => 'O',
'Ǒ' => 'O',
'Ǿ' => 'O',
'ŕ' => 'r',
'ŗ' => 'r',
'ř' => 'r',
'Ŕ' => 'R',
'Ŗ' => 'R',
'Ř' => 'R',
'ś' => 's',
'š' => 's',
'ŝ' => 's',
'ş' => 's',
'Ś' => 'S',
'Š' => 'S',
'Ŝ' => 'S',
'Ş' => 'S',
'ţ' => 't',
'ť' => 't',
'ŧ' => 't',
'Ţ' => 'T',
'Ť' => 'T',
'Ŧ' => 'T',
'ù' => 'u',
'ú' => 'u',
'û' => 'u',
'ü' => 'u',
'ũ' => 'u',
'ū' => 'u',
'ŭ' => 'u',
'ů' => 'u',
'ű' => 'u',
'ų' => 'u',
'ư' => 'u',
'ǔ' => 'u',
'ǖ' => 'u',
'ǘ' => 'u',
'ǚ' => 'u',
'ǜ' => 'u',
'Ù' => 'U',
'Ú' => 'U',
'Û' => 'U',
'Ü' => 'U',
'Ũ' => 'U',
'Ū' => 'U',
'Ŭ' => 'U',
'Ů' => 'U',
'Ű' => 'U',
'Ų' => 'U',
'Ư' => 'U',
'Ǔ' => 'U',
'Ǖ' => 'U',
'Ǘ' => 'U',
'Ǚ' => 'U',
'Ǜ' => 'U',
'ŵ' => 'w',
'Ŵ' => 'W',
'ý' => 'y',
'ÿ' => 'y',
'ŷ' => 'y',
'Ý' => 'Y',
'Ÿ' => 'Y',
'Ŷ' => 'Y',
'ż' => 'z',
'ź' => 'z',
'ž' => 'z',
'Ż' => 'Z',
'Ź' => 'Z',
'Ž' => 'Z',
// accentuated ligatures
'Ǽ' => 'A',
'ǽ' => 'a',
];
return strtr($str, $map);
}
Upvotes: 4
Reputation: 31
<?php
/*
* Thanks:
* - The idea of extracting accents equiv chars with the help of the HTMLSpecialChars convertion was taking from ICanBoogie Package of 'Olivier Laviale' {@link http://www.weirdog.com/blog/php/supprimer-les-accents-des-caracteres-accentues.html}
*/
function accentCharsModifier($str){
if(($length=mb_strlen($str,"UTF-8"))<strlen($str)){
$i=$count=0;
while($i<$length){
if(strlen($c=mb_substr($str,$i,1,"UTF-8"))>1){
$he=htmlentities($c);
if(($nC=preg_replace("#&([A-Za-z])(?:acute|cedil|caron|circ|grave|orn|ring|slash|th|tilde|uml);#", "\\1", $he))!=$he ||
($nC=preg_replace("#&([A-Za-z]{2})(?:lig);#", "\\1", $he))!=$he ||
($nC=preg_replace("#&[^;]+;#", "", $he))!=$he){
$str=str_replace($c,$nC,$str,$count);if($nC==""){$length=$length-$count;$i--;}
}
}
$i++;
}
}
return $str;
}
echo accentCharsModifier("&éôpkAÈû");
?>
Upvotes: 0
Reputation: 1330
When using iconv
, the parameter locale must be set:
function test_enc($text = 'ěščřžýáíé ĚŠČŘŽÝÁÍÉ fóø bår FÓØ BÅR æ')
{
echo '<tt>';
echo iconv('utf8', 'ascii//TRANSLIT', $text);
echo '</tt><br/>';
}
test_enc();
setlocale(LC_ALL, 'cs_CZ.utf8');
test_enc();
setlocale(LC_ALL, 'en_US.utf8');
test_enc();
Yields into:
????????? ????????? f?? b?r F?? B?R ae
escrzyaie ESCRZYAIE fo? bar FO? BAR ae
escrzyaie ESCRZYAIE fo? bar FO? BAR ae
Another locales then cs_CZ and en_US I haven't installed and I can't test it.
In C# I see solution using translation to unicode normalized form - accents are splitted out and then filtered via nonspacing unicode category.
Upvotes: 19
Reputation: 3640
here is a simple function that i use usually to remove accents :
function str_without_accents($str, $charset='utf-8')
{
$str = htmlentities($str, ENT_NOQUOTES, $charset);
$str = preg_replace('#&([A-za-z])(?:acute|cedil|caron|circ|grave|orn|ring|slash|th|tilde|uml);#', '\1', $str);
$str = preg_replace('#&([A-za-z]{2})(?:lig);#', '\1', $str); // pour les ligatures e.g. 'œ'
$str = preg_replace('#&[^;]+;#', '', $str); // supprime les autres caractères
return $str; // or add this : mb_strtoupper($str); for uppercase :)
}
Upvotes: 8
Reputation: 179
An improved version of remove_accents()
function according to last version Wordpress 4.3 formatting is:
function mbstring_binary_safe_encoding( $reset = false ) {
static $encodings = array();
static $overloaded = null;
if ( is_null( $overloaded ) )
$overloaded = function_exists( 'mb_internal_encoding' ) && ( ini_get( 'mbstring.func_overload' ) & 2 );
if ( false === $overloaded )
return;
if ( ! $reset ) {
$encoding = mb_internal_encoding();
array_push( $encodings, $encoding );
mb_internal_encoding( 'ISO-8859-1' );
}
if ( $reset && $encodings ) {
$encoding = array_pop( $encodings );
mb_internal_encoding( $encoding );
}
}
function reset_mbstring_encoding() {
mbstring_binary_safe_encoding( true );
}
function seems_utf8( $str ) {
mbstring_binary_safe_encoding();
$length = strlen($str);
reset_mbstring_encoding();
for ($i=0; $i < $length; $i++) {
$c = ord($str[$i]);
if ($c < 0x80) $n = 0; // 0bbbbbbb
elseif (($c & 0xE0) == 0xC0) $n=1; // 110bbbbb
elseif (($c & 0xF0) == 0xE0) $n=2; // 1110bbbb
elseif (($c & 0xF8) == 0xF0) $n=3; // 11110bbb
elseif (($c & 0xFC) == 0xF8) $n=4; // 111110bb
elseif (($c & 0xFE) == 0xFC) $n=5; // 1111110b
else return false; // Does not match any model
for ($j=0; $j<$n; $j++) { // n bytes matching 10bbbbbb follow ?
if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
return false;
}
}
return true;
}
function remove_accents( $string ) {
if ( !preg_match('/[\x80-\xff]/', $string) )
return $string;
if (seems_utf8($string)) {
$chars = array(
// Decompositions for Latin-1 Supplement
chr(194).chr(170) => 'a', chr(194).chr(186) => 'o',
chr(195).chr(128) => 'A', chr(195).chr(129) => 'A',
chr(195).chr(130) => 'A', chr(195).chr(131) => 'A',
chr(195).chr(132) => 'A', chr(195).chr(133) => 'A',
chr(195).chr(134) => 'AE',chr(195).chr(135) => 'C',
chr(195).chr(136) => 'E', chr(195).chr(137) => 'E',
chr(195).chr(138) => 'E', chr(195).chr(139) => 'E',
chr(195).chr(140) => 'I', chr(195).chr(141) => 'I',
chr(195).chr(142) => 'I', chr(195).chr(143) => 'I',
chr(195).chr(144) => 'D', chr(195).chr(145) => 'N',
chr(195).chr(146) => 'O', chr(195).chr(147) => 'O',
chr(195).chr(148) => 'O', chr(195).chr(149) => 'O',
chr(195).chr(150) => 'O', chr(195).chr(153) => 'U',
chr(195).chr(154) => 'U', chr(195).chr(155) => 'U',
chr(195).chr(156) => 'U', chr(195).chr(157) => 'Y',
chr(195).chr(158) => 'TH',chr(195).chr(159) => 's',
chr(195).chr(160) => 'a', chr(195).chr(161) => 'a',
chr(195).chr(162) => 'a', chr(195).chr(163) => 'a',
chr(195).chr(164) => 'a', chr(195).chr(165) => 'a',
chr(195).chr(166) => 'ae',chr(195).chr(167) => 'c',
chr(195).chr(168) => 'e', chr(195).chr(169) => 'e',
chr(195).chr(170) => 'e', chr(195).chr(171) => 'e',
chr(195).chr(172) => 'i', chr(195).chr(173) => 'i',
chr(195).chr(174) => 'i', chr(195).chr(175) => 'i',
chr(195).chr(176) => 'd', chr(195).chr(177) => 'n',
chr(195).chr(178) => 'o', chr(195).chr(179) => 'o',
chr(195).chr(180) => 'o', chr(195).chr(181) => 'o',
chr(195).chr(182) => 'o', chr(195).chr(184) => 'o',
chr(195).chr(185) => 'u', chr(195).chr(186) => 'u',
chr(195).chr(187) => 'u', chr(195).chr(188) => 'u',
chr(195).chr(189) => 'y', chr(195).chr(190) => 'th',
chr(195).chr(191) => 'y', chr(195).chr(152) => 'O',
// Decompositions for Latin Extended-A
chr(196).chr(128) => 'A', chr(196).chr(129) => 'a',
chr(196).chr(130) => 'A', chr(196).chr(131) => 'a',
chr(196).chr(132) => 'A', chr(196).chr(133) => 'a',
chr(196).chr(134) => 'C', chr(196).chr(135) => 'c',
chr(196).chr(136) => 'C', chr(196).chr(137) => 'c',
chr(196).chr(138) => 'C', chr(196).chr(139) => 'c',
chr(196).chr(140) => 'C', chr(196).chr(141) => 'c',
chr(196).chr(142) => 'D', chr(196).chr(143) => 'd',
chr(196).chr(144) => 'D', chr(196).chr(145) => 'd',
chr(196).chr(146) => 'E', chr(196).chr(147) => 'e',
chr(196).chr(148) => 'E', chr(196).chr(149) => 'e',
chr(196).chr(150) => 'E', chr(196).chr(151) => 'e',
chr(196).chr(152) => 'E', chr(196).chr(153) => 'e',
chr(196).chr(154) => 'E', chr(196).chr(155) => 'e',
chr(196).chr(156) => 'G', chr(196).chr(157) => 'g',
chr(196).chr(158) => 'G', chr(196).chr(159) => 'g',
chr(196).chr(160) => 'G', chr(196).chr(161) => 'g',
chr(196).chr(162) => 'G', chr(196).chr(163) => 'g',
chr(196).chr(164) => 'H', chr(196).chr(165) => 'h',
chr(196).chr(166) => 'H', chr(196).chr(167) => 'h',
chr(196).chr(168) => 'I', chr(196).chr(169) => 'i',
chr(196).chr(170) => 'I', chr(196).chr(171) => 'i',
chr(196).chr(172) => 'I', chr(196).chr(173) => 'i',
chr(196).chr(174) => 'I', chr(196).chr(175) => 'i',
chr(196).chr(176) => 'I', chr(196).chr(177) => 'i',
chr(196).chr(178) => 'IJ',chr(196).chr(179) => 'ij',
chr(196).chr(180) => 'J', chr(196).chr(181) => 'j',
chr(196).chr(182) => 'K', chr(196).chr(183) => 'k',
chr(196).chr(184) => 'k', chr(196).chr(185) => 'L',
chr(196).chr(186) => 'l', chr(196).chr(187) => 'L',
chr(196).chr(188) => 'l', chr(196).chr(189) => 'L',
chr(196).chr(190) => 'l', chr(196).chr(191) => 'L',
chr(197).chr(128) => 'l', chr(197).chr(129) => 'L',
chr(197).chr(130) => 'l', chr(197).chr(131) => 'N',
chr(197).chr(132) => 'n', chr(197).chr(133) => 'N',
chr(197).chr(134) => 'n', chr(197).chr(135) => 'N',
chr(197).chr(136) => 'n', chr(197).chr(137) => 'N',
chr(197).chr(138) => 'n', chr(197).chr(139) => 'N',
chr(197).chr(140) => 'O', chr(197).chr(141) => 'o',
chr(197).chr(142) => 'O', chr(197).chr(143) => 'o',
chr(197).chr(144) => 'O', chr(197).chr(145) => 'o',
chr(197).chr(146) => 'OE',chr(197).chr(147) => 'oe',
chr(197).chr(148) => 'R',chr(197).chr(149) => 'r',
chr(197).chr(150) => 'R',chr(197).chr(151) => 'r',
chr(197).chr(152) => 'R',chr(197).chr(153) => 'r',
chr(197).chr(154) => 'S',chr(197).chr(155) => 's',
chr(197).chr(156) => 'S',chr(197).chr(157) => 's',
chr(197).chr(158) => 'S',chr(197).chr(159) => 's',
chr(197).chr(160) => 'S', chr(197).chr(161) => 's',
chr(197).chr(162) => 'T', chr(197).chr(163) => 't',
chr(197).chr(164) => 'T', chr(197).chr(165) => 't',
chr(197).chr(166) => 'T', chr(197).chr(167) => 't',
chr(197).chr(168) => 'U', chr(197).chr(169) => 'u',
chr(197).chr(170) => 'U', chr(197).chr(171) => 'u',
chr(197).chr(172) => 'U', chr(197).chr(173) => 'u',
chr(197).chr(174) => 'U', chr(197).chr(175) => 'u',
chr(197).chr(176) => 'U', chr(197).chr(177) => 'u',
chr(197).chr(178) => 'U', chr(197).chr(179) => 'u',
chr(197).chr(180) => 'W', chr(197).chr(181) => 'w',
chr(197).chr(182) => 'Y', chr(197).chr(183) => 'y',
chr(197).chr(184) => 'Y', chr(197).chr(185) => 'Z',
chr(197).chr(186) => 'z', chr(197).chr(187) => 'Z',
chr(197).chr(188) => 'z', chr(197).chr(189) => 'Z',
chr(197).chr(190) => 'z', chr(197).chr(191) => 's',
// Decompositions for Latin Extended-B
chr(200).chr(152) => 'S', chr(200).chr(153) => 's',
chr(200).chr(154) => 'T', chr(200).chr(155) => 't',
// Euro Sign
chr(226).chr(130).chr(172) => 'E',
// GBP (Pound) Sign
chr(194).chr(163) => '',
// Vowels with diacritic (Vietnamese)
// unmarked
chr(198).chr(160) => 'O', chr(198).chr(161) => 'o',
chr(198).chr(175) => 'U', chr(198).chr(176) => 'u',
// grave accent
chr(225).chr(186).chr(166) => 'A', chr(225).chr(186).chr(167) => 'a',
chr(225).chr(186).chr(176) => 'A', chr(225).chr(186).chr(177) => 'a',
chr(225).chr(187).chr(128) => 'E', chr(225).chr(187).chr(129) => 'e',
chr(225).chr(187).chr(146) => 'O', chr(225).chr(187).chr(147) => 'o',
chr(225).chr(187).chr(156) => 'O', chr(225).chr(187).chr(157) => 'o',
chr(225).chr(187).chr(170) => 'U', chr(225).chr(187).chr(171) => 'u',
chr(225).chr(187).chr(178) => 'Y', chr(225).chr(187).chr(179) => 'y',
// hook
chr(225).chr(186).chr(162) => 'A', chr(225).chr(186).chr(163) => 'a',
chr(225).chr(186).chr(168) => 'A', chr(225).chr(186).chr(169) => 'a',
chr(225).chr(186).chr(178) => 'A', chr(225).chr(186).chr(179) => 'a',
chr(225).chr(186).chr(186) => 'E', chr(225).chr(186).chr(187) => 'e',
chr(225).chr(187).chr(130) => 'E', chr(225).chr(187).chr(131) => 'e',
chr(225).chr(187).chr(136) => 'I', chr(225).chr(187).chr(137) => 'i',
chr(225).chr(187).chr(142) => 'O', chr(225).chr(187).chr(143) => 'o',
chr(225).chr(187).chr(148) => 'O', chr(225).chr(187).chr(149) => 'o',
chr(225).chr(187).chr(158) => 'O', chr(225).chr(187).chr(159) => 'o',
chr(225).chr(187).chr(166) => 'U', chr(225).chr(187).chr(167) => 'u',
chr(225).chr(187).chr(172) => 'U', chr(225).chr(187).chr(173) => 'u',
chr(225).chr(187).chr(182) => 'Y', chr(225).chr(187).chr(183) => 'y',
// tilde
chr(225).chr(186).chr(170) => 'A', chr(225).chr(186).chr(171) => 'a',
chr(225).chr(186).chr(180) => 'A', chr(225).chr(186).chr(181) => 'a',
chr(225).chr(186).chr(188) => 'E', chr(225).chr(186).chr(189) => 'e',
chr(225).chr(187).chr(132) => 'E', chr(225).chr(187).chr(133) => 'e',
chr(225).chr(187).chr(150) => 'O', chr(225).chr(187).chr(151) => 'o',
chr(225).chr(187).chr(160) => 'O', chr(225).chr(187).chr(161) => 'o',
chr(225).chr(187).chr(174) => 'U', chr(225).chr(187).chr(175) => 'u',
chr(225).chr(187).chr(184) => 'Y', chr(225).chr(187).chr(185) => 'y',
// acute accent
chr(225).chr(186).chr(164) => 'A', chr(225).chr(186).chr(165) => 'a',
chr(225).chr(186).chr(174) => 'A', chr(225).chr(186).chr(175) => 'a',
chr(225).chr(186).chr(190) => 'E', chr(225).chr(186).chr(191) => 'e',
chr(225).chr(187).chr(144) => 'O', chr(225).chr(187).chr(145) => 'o',
chr(225).chr(187).chr(154) => 'O', chr(225).chr(187).chr(155) => 'o',
chr(225).chr(187).chr(168) => 'U', chr(225).chr(187).chr(169) => 'u',
// dot below
chr(225).chr(186).chr(160) => 'A', chr(225).chr(186).chr(161) => 'a',
chr(225).chr(186).chr(172) => 'A', chr(225).chr(186).chr(173) => 'a',
chr(225).chr(186).chr(182) => 'A', chr(225).chr(186).chr(183) => 'a',
chr(225).chr(186).chr(184) => 'E', chr(225).chr(186).chr(185) => 'e',
chr(225).chr(187).chr(134) => 'E', chr(225).chr(187).chr(135) => 'e',
chr(225).chr(187).chr(138) => 'I', chr(225).chr(187).chr(139) => 'i',
chr(225).chr(187).chr(140) => 'O', chr(225).chr(187).chr(141) => 'o',
chr(225).chr(187).chr(152) => 'O', chr(225).chr(187).chr(153) => 'o',
chr(225).chr(187).chr(162) => 'O', chr(225).chr(187).chr(163) => 'o',
chr(225).chr(187).chr(164) => 'U', chr(225).chr(187).chr(165) => 'u',
chr(225).chr(187).chr(176) => 'U', chr(225).chr(187).chr(177) => 'u',
chr(225).chr(187).chr(180) => 'Y', chr(225).chr(187).chr(181) => 'y',
// Vowels with diacritic (Chinese, Hanyu Pinyin)
chr(201).chr(145) => 'a',
// macron
chr(199).chr(149) => 'U', chr(199).chr(150) => 'u',
// acute accent
chr(199).chr(151) => 'U', chr(199).chr(152) => 'u',
// caron
chr(199).chr(141) => 'A', chr(199).chr(142) => 'a',
chr(199).chr(143) => 'I', chr(199).chr(144) => 'i',
chr(199).chr(145) => 'O', chr(199).chr(146) => 'o',
chr(199).chr(147) => 'U', chr(199).chr(148) => 'u',
chr(199).chr(153) => 'U', chr(199).chr(154) => 'u',
// grave accent
chr(199).chr(155) => 'U', chr(199).chr(156) => 'u',
);
$string = strtr($string, $chars);
} else {
$chars = array();
// Assume ISO-8859-1 if not UTF-8
$chars['in'] = chr(128).chr(131).chr(138).chr(142).chr(154).chr(158)
.chr(159).chr(162).chr(165).chr(181).chr(192).chr(193).chr(194)
.chr(195).chr(196).chr(197).chr(199).chr(200).chr(201).chr(202)
.chr(203).chr(204).chr(205).chr(206).chr(207).chr(209).chr(210)
.chr(211).chr(212).chr(213).chr(214).chr(216).chr(217).chr(218)
.chr(219).chr(220).chr(221).chr(224).chr(225).chr(226).chr(227)
.chr(228).chr(229).chr(231).chr(232).chr(233).chr(234).chr(235)
.chr(236).chr(237).chr(238).chr(239).chr(241).chr(242).chr(243)
.chr(244).chr(245).chr(246).chr(248).chr(249).chr(250).chr(251)
.chr(252).chr(253).chr(255);
$chars['out'] = "EfSZszYcYuAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy";
$string = strtr($string, $chars['in'], $chars['out']);
$double_chars = array();
$double_chars['in'] = array(chr(140), chr(156), chr(198), chr(208), chr(222), chr(223), chr(230), chr(240), chr(254));
$double_chars['out'] = array('OE', 'oe', 'AE', 'DH', 'TH', 'ss', 'ae', 'dh', 'th');
$string = str_replace($double_chars['in'], $double_chars['out'], $string);
}
return $string;
}
My answer is an update of @dynamic solution since Romanian or perhaps other language diacritics weren't converted. I wrote the minimum functions and works like a charm.
print_r(remove_accents('Iași, Iași County, Romania'));
Upvotes: 0
Reputation:
$unwanted_array = array( '&' => 'and', '&' => 'and', '@' => 'at', '©' => 'c', '®' => 'r',
'̊'=>'','̧'=>'','̨'=>'','̄'=>'','̱'=>'',
'Á'=>'a','á'=>'a','À'=>'a','à'=>'a','Ă'=>'a','ă'=>'a','ắ'=>'a','Ắ'=>'A','Ằ'=>'A',
'ằ'=>'a','ẵ'=>'a','Ẵ'=>'A','ẳ'=>'a','Ẳ'=>'A','Â'=>'a','â'=>'a','ấ'=>'a','Ấ'=>'A',
'ầ'=>'a','Ầ'=>'a','ẩ'=>'a','Ẩ'=>'A','Ǎ'=>'a','ǎ'=>'a','Å'=>'a','å'=>'a','Ǻ'=>'a',
'ǻ'=>'a','Ä'=>'a','ä'=>'a','ã'=>'a','Ã'=>'A','Ą'=>'a','ą'=>'a','Ā'=>'a','ā'=>'a',
'ả'=>'a','Ả'=>'a','Ạ'=>'A','ạ'=>'a','ặ'=>'a','Ặ'=>'A','ậ'=>'a','Ậ'=>'A','Æ'=>'ae',
'æ'=>'ae','Ǽ'=>'ae','ǽ'=>'ae','ẫ'=>'a','Ẫ'=>'A',
'Ć'=>'c','ć'=>'c','Ĉ'=>'c','ĉ'=>'c','Č'=>'c','č'=>'c','Ċ'=>'c','ċ'=>'c','Ç'=>'c','ç'=>'c',
'Ď'=>'d','ď'=>'d','Ḑ'=>'D','ḑ'=>'d','Đ'=>'d','đ'=>'d','Ḍ'=>'D','ḍ'=>'d','Ḏ'=>'D','ḏ'=>'d','ð'=>'d','Ð'=>'D',
'É'=>'e','é'=>'e','È'=>'e','è'=>'e','Ĕ'=>'e','ĕ'=>'e','ê'=>'e','ế'=>'e','Ế'=>'E','ề'=>'e',
'Ề'=>'E','Ě'=>'e','ě'=>'e','Ë'=>'e','ë'=>'e','Ė'=>'e','ė'=>'e','Ę'=>'e','ę'=>'e','Ē'=>'e',
'ē'=>'e','ệ'=>'e','Ệ'=>'E','Ə'=>'e','ə'=>'e','ẽ'=>'e','Ẽ'=>'E','ễ'=>'e',
'Ễ'=>'E','ể'=>'e','Ể'=>'E','ẻ'=>'e','Ẻ'=>'E','ẹ'=>'e','Ẹ'=>'E',
'ƒ'=>'f',
'Ğ'=>'g','ğ'=>'g','Ĝ'=>'g','ĝ'=>'g','Ǧ'=>'G','ǧ'=>'g','Ġ'=>'g','ġ'=>'g','Ģ'=>'g','ģ'=>'g',
'H̲'=>'H','h̲'=>'h','Ĥ'=>'h','ĥ'=>'h','Ȟ'=>'H','ȟ'=>'h','Ḩ'=>'H','ḩ'=>'h','Ħ'=>'h','ħ'=>'h','Ḥ'=>'H','ḥ'=>'h',
'Ỉ'=>'I','Í'=>'i','í'=>'i','Ì'=>'i','ì'=>'i','Ĭ'=>'i','ĭ'=>'i','Î'=>'i','î'=>'i','Ǐ'=>'i','ǐ'=>'i',
'Ï'=>'i','ï'=>'i','Ḯ'=>'I','ḯ'=>'i','Ĩ'=>'i','ĩ'=>'i','İ'=>'i','Į'=>'i','į'=>'i','Ī'=>'i','ī'=>'i',
'ỉ'=>'I','Ị'=>'I','ị'=>'i','IJ'=>'ij','ij'=>'ij','ı'=>'i',
'Ĵ'=>'j','ĵ'=>'j',
'Ķ'=>'k','ķ'=>'k','Ḵ'=>'K','ḵ'=>'k',
'Ĺ'=>'l','ĺ'=>'l','Ľ'=>'l','ľ'=>'l','Ļ'=>'l','ļ'=>'l','Ł'=>'l','ł'=>'l','Ŀ'=>'l','ŀ'=>'l',
'Ń'=>'n','ń'=>'n','Ň'=>'n','ň'=>'n','Ñ'=>'N','ñ'=>'n','Ņ'=>'n','ņ'=>'n','Ṇ'=>'N','ṇ'=>'n','Ŋ'=>'n','ŋ'=>'n',
'Ó'=>'o','ó'=>'o','Ò'=>'o','ò'=>'o','Ŏ'=>'o','ŏ'=>'o','Ô'=>'o','ô'=>'o','ố'=>'o','Ố'=>'O','ồ'=>'o',
'Ồ'=>'O','ổ'=>'o','Ổ'=>'O','Ǒ'=>'o','ǒ'=>'o','Ö'=>'o','ö'=>'o','Ő'=>'o','ő'=>'o','Õ'=>'o','õ'=>'o',
'Ø'=>'o','ø'=>'o','Ǿ'=>'o','ǿ'=>'o','Ǫ'=>'O','ǫ'=>'o','Ǭ'=>'O','ǭ'=>'o','Ō'=>'o','ō'=>'o','ỏ'=>'o',
'Ỏ'=>'O','Ơ'=>'o','ơ'=>'o','ớ'=>'o','Ớ'=>'O','ờ'=>'o','Ờ'=>'O','ở'=>'o','Ở'=>'O','ợ'=>'o','Ợ'=>'O',
'ọ'=>'o','Ọ'=>'O','ọ'=>'o','Ọ'=>'O','ộ'=>'o','Ộ'=>'O','ỗ'=>'o','Ỗ'=>'O','ỡ'=>'o','Ỡ'=>'O',
'Œ'=>'oe','œ'=>'oe',
'ĸ'=>'k',
'Ŕ'=>'r','ŕ'=>'r','Ř'=>'r','ř'=>'r','ṙ'=>'r','Ŗ'=>'r','ŗ'=>'r','Ṛ'=>'R','ṛ'=>'r','Ṟ'=>'R','ṟ'=>'r',
'S̲'=>'S','s̲'=>'s','Ś'=>'s','ś'=>'s','Ŝ'=>'s','ŝ'=>'s','Š'=>'s','š'=>'s','Ş'=>'s','ş'=>'s',
'Ṣ'=>'S','ṣ'=>'s','Ș'=>'S','ș'=>'s',
'ſ'=>'z','ß'=>'ss','Ť'=>'t','ť'=>'t','Ţ'=>'t','ţ'=>'t','Ṭ'=>'T','ṭ'=>'t','Ț'=>'T',
'ț'=>'t','Ṯ'=>'T','ṯ'=>'t','™'=>'tm','Ŧ'=>'t','ŧ'=>'t',
'Ú'=>'u','ú'=>'u','Ù'=>'u','ù'=>'u','Ŭ'=>'u','ŭ'=>'u','Û'=>'u','û'=>'u','Ǔ'=>'u','ǔ'=>'u','Ů'=>'u','ů'=>'u',
'Ü'=>'u','ü'=>'u','Ǘ'=>'u','ǘ'=>'u','Ǜ'=>'u','ǜ'=>'u','Ǚ'=>'u','ǚ'=>'u','Ǖ'=>'u','ǖ'=>'u','Ű'=>'u','ű'=>'u',
'Ũ'=>'u','ũ'=>'u','Ų'=>'u','ų'=>'u','Ū'=>'u','ū'=>'u','Ư'=>'u','ư'=>'u','ứ'=>'u','Ứ'=>'U','ừ'=>'u','Ừ'=>'U',
'ử'=>'u','Ử'=>'U','ự'=>'u','Ự'=>'U','ụ'=>'u','Ụ'=>'U','ủ'=>'u','Ủ'=>'U','ữ'=>'u','Ữ'=>'U',
'Ŵ'=>'w','ŵ'=>'w',
'Ý'=>'y','ý'=>'y','ỳ'=>'y','Ỳ'=>'Y','Ŷ'=>'y','ŷ'=>'y','ÿ'=>'y','Ÿ'=>'y','ỹ'=>'y','Ỹ'=>'Y','ỷ'=>'y','Ỷ'=>'Y',
'Z̲'=>'Z','z̲'=>'z','Ź'=>'z','ź'=>'z','Ž'=>'z','ž'=>'z','Ż'=>'z','ż'=>'z','Ẕ'=>'Z','ẕ'=>'z',
'þ'=>'p','ʼn'=>'n','А'=>'a','а'=>'a','Б'=>'b','б'=>'b','В'=>'v','в'=>'v','Г'=>'g','г'=>'g','Ґ'=>'g','ґ'=>'g',
'Д'=>'d','д'=>'d','Е'=>'e','е'=>'e','Ё'=>'jo','ё'=>'jo','Є'=>'e','є'=>'e','Ж'=>'zh','ж'=>'zh','З'=>'z','з'=>'z',
'И'=>'i','и'=>'i','І'=>'i','і'=>'i','Ї'=>'i','ї'=>'i','Й'=>'j','й'=>'j','К'=>'k','к'=>'k','Л'=>'l','л'=>'l',
'М'=>'m','м'=>'m','Н'=>'n','н'=>'n','О'=>'o','о'=>'o','П'=>'p','п'=>'p','Р'=>'r','р'=>'r','С'=>'s','с'=>'s',
'Т'=>'t','т'=>'t','У'=>'u','у'=>'u','Ф'=>'f','ф'=>'f','Х'=>'h','х'=>'h','Ц'=>'c','ц'=>'c','Ч'=>'ch','ч'=>'ch',
'Ш'=>'sh','ш'=>'sh','Щ'=>'sch','щ'=>'sch','Ъ'=>'-',
'ъ'=>'-','Ы'=>'y','ы'=>'y','Ь'=>'-','ь'=>'-',
'Э'=>'je','э'=>'je','Ю'=>'ju','ю'=>'ju','Я'=>'ja','я'=>'ja','א'=>'a','ב'=>'b','ג'=>'g','ד'=>'d','ה'=>'h','ו'=>'v',
'ז'=>'z','ח'=>'h','ט'=>'t','י'=>'i','ך'=>'k','כ'=>'k','ל'=>'l','ם'=>'m','מ'=>'m','ן'=>'n','נ'=>'n','ס'=>'s','ע'=>'e',
'ף'=>'p','פ'=>'p','ץ'=>'C','צ'=>'c','ק'=>'q','ר'=>'r','ש'=>'w','ת'=>'t'
);
$accentsRemoved = strtr( $stringToRemoveAccents , $unwanted_array );
Upvotes: -1
Reputation: 1509
You can use an array key => value style to use with strtr() safely for UTF-8 characters even if they are multi-bytes.
function no_accent($str){
$accents = array('À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => 'A', 'Å' => 'A', 'à' => 'a', 'á' => 'a', 'â' => 'a', 'ã' => 'a', 'ä' => 'a', 'å' => 'a', 'Ā' => 'A', 'ā' => 'a', 'Ă' => 'A', 'ă' => 'a', 'Ą' => 'A', 'ą' => 'a', 'Ç' => 'C', 'ç' => 'c', 'Ć' => 'C', 'ć' => 'c', 'Ĉ' => 'C', 'ĉ' => 'c', 'Ċ' => 'C', 'ċ' => 'c', 'Č' => 'C', 'č' => 'c', 'Ð' => 'D', 'ð' => 'd', 'Ď' => 'D', 'ď' => 'd', 'Đ' => 'D', 'đ' => 'd', 'È' => 'E', 'É' => 'E', 'Ê' => 'E', 'Ë' => 'E', 'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e', 'Ē' => 'E', 'ē' => 'e', 'Ĕ' => 'E', 'ĕ' => 'e', 'Ė' => 'E', 'ė' => 'e', 'Ę' => 'E', 'ę' => 'e', 'Ě' => 'E', 'ě' => 'e', 'Ĝ' => 'G', 'ĝ' => 'g', 'Ğ' => 'G', 'ğ' => 'g', 'Ġ' => 'G', 'ġ' => 'g', 'Ģ' => 'G', 'ģ' => 'g', 'Ĥ' => 'H', 'ĥ' => 'h', 'Ħ' => 'H', 'ħ' => 'h', 'Ì' => 'I', 'Í' => 'I', 'Î' => 'I', 'Ï' => 'I', 'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'Ĩ' => 'I', 'ĩ' => 'i', 'Ī' => 'I', 'ī' => 'i', 'Ĭ' => 'I', 'ĭ' => 'i', 'Į' => 'I', 'į' => 'i', 'İ' => 'I', 'ı' => 'i', 'Ĵ' => 'J', 'ĵ' => 'j', 'Ķ' => 'K', 'ķ' => 'k', 'ĸ' => 'k', 'Ĺ' => 'L', 'ĺ' => 'l', 'Ļ' => 'L', 'ļ' => 'l', 'Ľ' => 'L', 'ľ' => 'l', 'Ŀ' => 'L', 'ŀ' => 'l', 'Ł' => 'L', 'ł' => 'l', 'Ñ' => 'N', 'ñ' => 'n', 'Ń' => 'N', 'ń' => 'n', 'Ņ' => 'N', 'ņ' => 'n', 'Ň' => 'N', 'ň' => 'n', 'ʼn' => 'n', 'Ŋ' => 'N', 'ŋ' => 'n', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O', 'Õ' => 'O', 'Ö' => 'O', 'Ø' => 'O', 'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'o', 'ø' => 'o', 'Ō' => 'O', 'ō' => 'o', 'Ŏ' => 'O', 'ŏ' => 'o', 'Ő' => 'O', 'ő' => 'o', 'Ŕ' => 'R', 'ŕ' => 'r', 'Ŗ' => 'R', 'ŗ' => 'r', 'Ř' => 'R', 'ř' => 'r', 'Ś' => 'S', 'ś' => 's', 'Ŝ' => 'S', 'ŝ' => 's', 'Ş' => 'S', 'ş' => 's', 'Š' => 'S', 'š' => 's', 'ſ' => 's', 'Ţ' => 'T', 'ţ' => 't', 'Ť' => 'T', 'ť' => 't', 'Ŧ' => 'T', 'ŧ' => 't', 'Ù' => 'U', 'Ú' => 'U', 'Û' => 'U', 'Ü' => 'U', 'ù' => 'u', 'ú' => 'u', 'û' => 'u', 'ü' => 'u', 'Ũ' => 'U', 'ũ' => 'u', 'Ū' => 'U', 'ū' => 'u', 'Ŭ' => 'U', 'ŭ' => 'u', 'Ů' => 'U', 'ů' => 'u', 'Ű' => 'U', 'ű' => 'u', 'Ų' => 'U', 'ų' => 'u', 'Ŵ' => 'W', 'ŵ' => 'w', 'Ý' => 'Y', 'ý' => 'y', 'ÿ' => 'y', 'Ŷ' => 'Y', 'ŷ' => 'y', 'Ÿ' => 'Y', 'Ź' => 'Z', 'ź' => 'z', 'Ż' => 'Z', 'ż' => 'z', 'Ž' => 'Z', 'ž' => 'z');
return strtr($str, $accents);
}
Plus, you save decode/encode in UTF-8 part.
Upvotes: 0
Reputation:
Merged Cazuma Nii Cavalcanti's implementation with Junior Mayhé's char list, hoping to save some time for some of you.
function stripAccents($str) {
return strtr(utf8_decode($str), utf8_decode('ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïñòóôõöøùúûüýÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƒƠơƯưǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǺǻǼǽǾǿ'), 'AAAAAAAECEEEEIIIIDNOOOOOOUUUUYsaaaaaaaeceeeeiiiinoooooouuuuyyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiIiIJijJjKkLlLlLlLlllNnNnNnnOoOoOoOEoeRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwYyYZzZzZzsfOoUuAaIiOoUuUuUuUuUuAaAEaeOo');
}
Upvotes: 2
Reputation: 1854
This is a piece of code I found and use often:
function stripAccents($stripAccents){
return strtr($stripAccents,'àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ','aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY');
}
Upvotes: 50
Reputation: 1606
WordPress' implementation is definitly the safest for UTF8 strings. For Latin1 strings, a simple strtr does the job, but ensure you're saving your script in LATIN1 format, not UTF-8.
Upvotes: -1
Reputation: 16411
Indeed is a matter of taste. There are many flavors for converting such letters.
function replaceAccents($str)
{
$a = array('À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'ÿ', 'Ā', 'ā', 'Ă', 'ă', 'Ą', 'ą', 'Ć', 'ć', 'Ĉ', 'ĉ', 'Ċ', 'ċ', 'Č', 'č', 'Ď', 'ď', 'Đ', 'đ', 'Ē', 'ē', 'Ĕ', 'ĕ', 'Ė', 'ė', 'Ę', 'ę', 'Ě', 'ě', 'Ĝ', 'ĝ', 'Ğ', 'ğ', 'Ġ', 'ġ', 'Ģ', 'ģ', 'Ĥ', 'ĥ', 'Ħ', 'ħ', 'Ĩ', 'ĩ', 'Ī', 'ī', 'Ĭ', 'ĭ', 'Į', 'į', 'İ', 'ı', 'IJ', 'ij', 'Ĵ', 'ĵ', 'Ķ', 'ķ', 'Ĺ', 'ĺ', 'Ļ', 'ļ', 'Ľ', 'ľ', 'Ŀ', 'ŀ', 'Ł', 'ł', 'Ń', 'ń', 'Ņ', 'ņ', 'Ň', 'ň', 'ʼn', 'Ō', 'ō', 'Ŏ', 'ŏ', 'Ő', 'ő', 'Œ', 'œ', 'Ŕ', 'ŕ', 'Ŗ', 'ŗ', 'Ř', 'ř', 'Ś', 'ś', 'Ŝ', 'ŝ', 'Ş', 'ş', 'Š', 'š', 'Ţ', 'ţ', 'Ť', 'ť', 'Ŧ', 'ŧ', 'Ũ', 'ũ', 'Ū', 'ū', 'Ŭ', 'ŭ', 'Ů', 'ů', 'Ű', 'ű', 'Ų', 'ų', 'Ŵ', 'ŵ', 'Ŷ', 'ŷ', 'Ÿ', 'Ź', 'ź', 'Ż', 'ż', 'Ž', 'ž', 'ſ', 'ƒ', 'Ơ', 'ơ', 'Ư', 'ư', 'Ǎ', 'ǎ', 'Ǐ', 'ǐ', 'Ǒ', 'ǒ', 'Ǔ', 'ǔ', 'Ǖ', 'ǖ', 'Ǘ', 'ǘ', 'Ǚ', 'ǚ', 'Ǜ', 'ǜ', 'Ǻ', 'ǻ', 'Ǽ', 'ǽ', 'Ǿ', 'ǿ');
$b = array('A', 'A', 'A', 'A', 'A', 'A', 'AE', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', 'D', 'N', 'O', 'O', 'O', 'O', 'O', 'O', 'U', 'U', 'U', 'U', 'Y', 's', 'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i', 'n', 'o', 'o', 'o', 'o', 'o', 'o', 'u', 'u', 'u', 'u', 'y', 'y', 'A', 'a', 'A', 'a', 'A', 'a', 'C', 'c', 'C', 'c', 'C', 'c', 'C', 'c', 'D', 'd', 'D', 'd', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'G', 'g', 'G', 'g', 'G', 'g', 'G', 'g', 'H', 'h', 'H', 'h', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'IJ', 'ij', 'J', 'j', 'K', 'k', 'L', 'l', 'L', 'l', 'L', 'l', 'L', 'l', 'l', 'l', 'N', 'n', 'N', 'n', 'N', 'n', 'n', 'O', 'o', 'O', 'o', 'O', 'o', 'OE', 'oe', 'R', 'r', 'R', 'r', 'R', 'r', 'S', 's', 'S', 's', 'S', 's', 'S', 's', 'T', 't', 'T', 't', 'T', 't', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'W', 'w', 'Y', 'y', 'Y', 'Z', 'z', 'Z', 'z', 'Z', 'z', 's', 'f', 'O', 'o', 'U', 'u', 'A', 'a', 'I', 'i', 'O', 'o', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'A', 'a', 'AE', 'ae', 'O', 'o');
return str_replace($a, $b, $str);
}
Upvotes: 12
Reputation: 1788
UTF-8 friendly version of the simple function posted above by Gino:
function stripAccents($str) {
return strtr(utf8_decode($str), utf8_decode('àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ'), 'aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY');
}
Had to come to this because my php document was UTF-8 encoded.
Hope it helps.
Upvotes: 63
Reputation: 28656
I can't reproduce your problem. I get the expected result.
How exactly are you using mb_detect_encoding()
to verify your string is in fact UTF-8?
If I simply call mb_detect_encoding($input)
on both a UTF-8 and ISO-8859-1 encoded version of your string, both of them return "UTF-8", so that function isn't particularly reliable.
iconv()
gives me a PHP "notice" when it gets the wrongly encoded string and only echoes "F", but that might just be because of different PHP/iconv settings/versions (?).
I suggest to you try calling mb_check_encoding($input, "utf-8")
first to verify that your string really is UTF-8. I think it probably isn't.
Upvotes: 2
Reputation:
One of the tricks I stumbled upon on the web was using htmlentities then stripping the encoded character :
$stripped = preg_replace('`&[^;]+;`','',htmlentities($string));
Not perfect but it does work well in some case.
But, you're writing about creating an URL string, so urlencode and its counterpart urldecode may be better. Or, if you are creating a query string, use this last function : http_build_query.
Upvotes: -1
Reputation: 39138
I agree with georgebrock's comment.
If you find a way to get //TRANSLIT to work, you can build friendly URLs:
$url = preg_replace( '/(\w)[^\w\s](\w)/', '$1$2', $url );
$url = preg_replace( '/[^a-z0-9]+/', '-', $url );
$url = preg_replace( '-'
, e.g. '/(?:(^|\-)\-+|\-$)/', '', $url );
If you can't get it to work, replace setp 1 with strtr/character-based replacement, like Xetius' solution.
Upvotes: 2
Reputation: 46794
You could use urlencode. Does not quite do what you want (remove accents), but will give you a url usable string
$output = urlencode ($input);
In Perl I could use a translate regex, but I cannot think of the PHP equivalent
$input =~ tr/áâàå/aaaa/;
etc...
you could do this using preg_replace
$patterns[0] = '/[á|â|à|å|ä]/';
$patterns[1] = '/[ð|é|ê|è|ë]/';
$patterns[2] = '/[í|î|ì|ï]/';
$patterns[3] = '/[ó|ô|ò|ø|õ|ö]/';
$patterns[4] = '/[ú|û|ù|ü]/';
$patterns[5] = '/æ/';
$patterns[6] = '/ç/';
$patterns[7] = '/ß/';
$replacements[0] = 'a';
$replacements[1] = 'e';
$replacements[2] = 'i';
$replacements[3] = 'o';
$replacements[4] = 'u';
$replacements[5] = 'ae';
$replacements[6] = 'c';
$replacements[7] = 'ss';
$output = preg_replace($patterns, $replacements, $input);
(Please note this was typed from a foggy beer ridden Friday after noon memory, so may not be 100% correct)
or you could make a hash table and do a replacement based off of that.
Upvotes: 6