Reputation: 83622
I currentyl have no clue on how to sort an array which contains UTF-8 encoded strings in PHP. The array comes from a LDAP server so sorting via a database (would be no problem) is no solution. The following does not work on my windows development machine (although I'd think that this should be at least a possible solution):
$array=array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
$oldLocal=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, 'German_Germany.65001'));
usort($array, 'strcoll');
var_dump(setlocale(LC_COLLATE, $oldLocal));
var_dump($array);
The output is:
string(20) "German_Germany.65001"
string(1) "C"
array(6) {
[0]=>
string(6) "Birnen"
[1]=>
string(9) "Ungetiere"
[2]=>
string(6) "Äpfel"
[3]=>
string(5) "Apfel"
[4]=>
string(9) "Ungetüme"
[5]=>
string(11) "Österreich"
}
This is complete nonsense. Using 1252 as the codepage for setlocale()
gives another output but still a plainly wrong one:
string(19) "German_Germany.1252"
string(1) "C"
array(6) {
[0]=>
string(11) "Österreich"
[1]=>
string(6) "Äpfel"
[2]=>
string(5) "Apfel"
[3]=>
string(6) "Birnen"
[4]=>
string(9) "Ungetüme"
[5]=>
string(9) "Ungetiere"
}
Is there a way to sort an array with UTF-8 strings locale aware?
Just noted that this seems to be PHP on Windows problem, as the same snippet with de_DE.utf8
used as locale works on a Linux machine. Nevertheless a solution for this Windows-specific problem would be nice...
Upvotes: 27
Views: 34854
Reputation: 469
I am confronted with the same problem with German "Umlaute". After some research, this worked for me:
$laender =array("Österreich", "Schweiz", "England", "France", "Ägypten");
$laender = array_map("utf8_decode", $laender);
setlocale(LC_ALL,"de_DE@euro", "de_DE", "deu_deu");
sort($laender, SORT_LOCALE_STRING);
$laender = array_map("utf8_encode", $laender);
print_r($laender);
The result:
Array
(
[0] => Ägypten
[1] => England
[2] => France
[3] => Österreich
[4] => Schweiz
)
Upvotes: 0
Reputation: 5374
I found this following helper function to convert all letters of a string to ASCII letters very helpful here.
function _all_letters_to_ASCII($string) {
return strtr(utf8_decode($string),
utf8_decode('ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
After that a simple array_multisort()
gives you what you want.
$array = array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
$reference_array = $array;
foreach ($reference_array as $key => &$value) {
$value = _all_letters_to_ASCII($value);
}
var_dump($reference_array);
array_multisort($reference_array, $array);
var_dump($array);
Of course you can make the helper function fit more advanced needs. But for now, it looks pretty good.
array(6) {
[0]=> string(6) "Birnen"
[1]=> string(5) "Apfel"
[2]=> string(8) "Ungetume"
[3]=> string(5) "Apfel"
[4]=> string(9) "Ungetiere"
[5]=> string(10) "Osterreich"
}
array(6) {
[0]=> string(5) "Apfel"
[1]=> string(6) "Äpfel"
[2]=> string(6) "Birnen"
[3]=> string(11) "Österreich"
[4]=> string(9) "Ungetiere"
[5]=> string(9) "Ungetüme"
}
Upvotes: 1
Reputation: 2906
$a = array( 'Кръстев', 'Делян1', 'делян1', 'Делян2', 'делян3', 'кръстев' );
$col = new \Collator('bg_BG');
$col->asort( $a );
var_dump( $a );
Prints:
array
2 => string 'делян1' (length=11)
1 => string 'Делян1' (length=11)
3 => string 'Делян2' (length=11)
4 => string 'делян3' (length=11)
5 => string 'кръстев' (length=14)
0 => string 'Кръстев' (length=14)
The Collator
class is defined in PECL intl extension. It is distributed with PHP 5.3 sources but might be disabled for some builds. E.g. in Debian it is in package php5-intl .
Collator::compare
is useful for usort
.
Upvotes: 35
Reputation: 83622
Update on this issue:
Even though the discussion around this problem revealed that we could have discovered a PHP bug with strcoll()
and/or setlocale()
, this is clearly not the case. The problem is rather a limitation of the Windows CRT implementation of setlocale()
(PHPs setlocale()
is just a thin wrapper around the CRT call). The following is a citation of the MSDN page "setlocale, _wsetlocale":
The set of available languages, country/region codes, and code pages includes all those supported by the Win32 NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL. The set of language and country/region codes supported by setlocale is listed in Language and Country/Region Strings.
It therefore is impossible to use locale-aware string operations within PHP on Windows when strings are multi-byte encoded.
Upvotes: 8
Reputation: 83622
Eventually this problem cannot be solved in a simple way without using recoded strings (UTF-8 → Windows-1252 or ISO-8859-1) as suggested by ΤΖΩΤΖΙΟΥ due to an obvious PHP bug as discovered by Huppie. To summarize the problem, I created the following code snippet which clearly demonstrates that the problem is the strcoll() function when using the 65001 Windows-UTF-8-codepage.
function traceStrColl($a, $b) {
$outValue=strcoll($a, $b);
echo "$a $b $outValue\r\n";
return $outValue;
}
$locale=(defined('PHP_OS') && stristr(PHP_OS, 'win')) ? 'German_Germany.65001' : 'de_DE.utf8';
$string="ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜabcdefghijklmnopqrstuvwxyzäöüß";
$array=array();
for ($i=0; $i<mb_strlen($string, 'UTF-8'); $i++) {
$array[]=mb_substr($string, $i, 1, 'UTF-8');
}
$oldLocale=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, $locale));
usort($array, 'traceStrColl');
setlocale(LC_COLLATE, $oldLocale);
var_dump($array);
The result is:
string(20) "German_Germany.65001"
a B 2147483647
[...]
array(59) {
[0]=>
string(1) "c"
[1]=>
string(1) "B"
[2]=>
string(1) "s"
[3]=>
string(1) "C"
[4]=>
string(1) "k"
[5]=>
string(1) "D"
[6]=>
string(2) "ä"
[7]=>
string(1) "E"
[8]=>
string(1) "g"
[...]
The same snippet works on a Linux machine without any problems producing the following output:
string(10) "de_DE.utf8"
a B -1
[...]
array(59) {
[0]=>
string(1) "a"
[1]=>
string(1) "A"
[2]=>
string(2) "ä"
[3]=>
string(2) "Ä"
[4]=>
string(1) "b"
[5]=>
string(1) "B"
[6]=>
string(1) "c"
[7]=>
string(1) "C"
[...]
The snippet also works when using Windows-1252 (ISO-8859-1) encoded strings (of course the mb_* encodings and the locale must be changed then).
I filed a bug report on bugs.php.net: Bug #46165 strcoll() does not work with UTF-8 strings on Windows. If you experience the same problem, you can give your feedback to the PHP team on the bug-report page (two other, probably related, bugs have been classified as bogus - I don't think that this bug is bogus ;-).
Thanks to all of you.
Upvotes: 5
Reputation: 117477
Your collation needs to match the character set. Since your data is UTF-8 encoded, you should use a UTF-8 collation. It could be named differently on different platforms, but a good guess would be de_DE.utf8
.
On UNIX systems, you can get a list of currently installed locales with the command
locale -a
Upvotes: -1
Reputation: 11432
Using your example with codepage 1252 worked perfectly fine here on my windows development machine.
$array=array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
$oldLocal=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, 'German_Germany.1252'));
usort($array, 'strcoll');
var_dump(setlocale(LC_COLLATE, $oldLocal));
var_dump($array);
...snip...
This was with PHP 5.2.6. btw.
function traceStrColl($a, $b) {
$outValue = strcoll($a, $b);
echo "$a $b $outValue\r\n";
return $outValue;
}
$array=array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
setlocale(LC_COLLATE, 'German_Germany.65001');
usort($array, 'traceStrColl');
print_r($array);
gives:
Ungetüme Äpfel 2147483647 Ungetüme Birnen 2147483647 Ungetüme Apfel 2147483647 Ungetüme Ungetiere 2147483647 Österreich Ungetüme 2147483647 Äpfel Ungetiere 2147483647 Äpfel Birnen 2147483647 Apfel Äpfel 2147483647 Ungetiere Birnen 2147483647
I did find some bug reports which have been flagged being bogus... The best bet you have is filing a bug-report I suppose though...
Upvotes: 0
Reputation: 95921
This is a very complex issue, since UTF-8 encoded data can contain any Unicode character (i.e. characters from many 8-bit encodings which collate differently in different locales).
Perhaps if you converted your UTF-8 data into Unicode (not familiar with PHP unicode functions, sorry) and then normalized them into NFD or NFKD and then sorting on code points might give some collation that would make sense to you (ie "A" before "Ä").
Check the links I provided.
EDIT: since you mention that your input data are clear (I assume they all fall in the "windows-1252" codepage), then you should do the following conversion: UTF-8 → Unicode → Windows-1252, on which Windows-1252 encoded data do a sort selecting the "CP1252" locale.
Upvotes: 4