Reputation: 5810
I am not familiar with Postgresql. Trying to learn it because I am moving my Rails apps to Heroku.
Here's an example with the ordering problem.
# select name_kr from users order by name_kr;
name_kr
---------
곽철
김영
박영
안준
양민
이남
임유
정신
차욱
강동수
강상구
강신용
강용석
강지영
강지원
강호석
You may not understand Korean. But one weird thing is that it displays 2 syllable words first and 3 syllables - each corretly ordered in its group.
Here's the related info:
kwanak_development=# show lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)
kwanak_development=# show lc_ctype;
lc_ctype
-------------
en_US.UTF-8
(1 row)
What did I do wrong?
Thanks.
Sam
Additional Info:
I tried collation for order by and got an interesting result.
select name_kr from users order by name_kr collate "ko_KR"; => Same as above
select name_kr from users order by name_kr collate "C"; => Correct Result
Upvotes: 2
Views: 1920
Reputation: 47173
Using GNU sort
5.93 on OS X, i get the same ordering in the default locale (which is probably one of en_GB.utf8 or en_US.utf8 - something which doesn't know Korean, anyway). However if i set LC_ALL
to ko_KR.utf8, i get the three-character strings sorted first. The sets of two- and three- character strings keep the same order between themselves.
Note that all the three-character names begin with '강'. What this looks like is that '강' sorts after all the other initial characters in a naive locale, but sorts before it in Korean. If i insert a nonsense string made of one of the three-character strings with the initial character replaced with the initial character of one of the two-character strings (that is, "양호석"), then that sorts in with the two-character strings. This shows that the sort order is nothing to do with the length of the strings, and simply to do with the sorting of '강'.
I have absolutely no idea why '강' sorts after the other characters in my locale. '강' is at code point U+AC15. '곽' is at code point U+ACFD. '차' is at code point U+CC28. If the sort was on raw code point, '강' would sort before the other characters, as it does with the Korean sort.
If i sort these strings with Java, they come out with the '강' strings first, like the Korean sort. Java is pretty careful about unicode matters. The fact that it and the Korean sort agree leads me to think that that is the correct order.
If you encode the characters in UTF-8, then its first byte is 0xea, which again would sort before the other characters, which encode to bytes starting with values from 0xea to 0xec. This is presumably why collate "C"
gives you the right result - that setting causes the strings to be sorted as strings of opaque bytes, not encoded characters.
I am completely baffled as to why collate "ko_KR"
gives the wrong result.
Upvotes: 1
Reputation: 26454
PostgreSQL collation is mostly handled by PostgreSQL and should follow the same rules as the UNIX sort
command. The first thing to do is to try using the sort
command to determine if this is in fact the problem or if it is merely a symptom of something further down the stack.
If sort
does not show this problem with the same locale settings, then please file a bug with the PostgreSQL team (this strikes me as very unlikely but it is possible). If it does show the problem, then you will need to take it up with the makers of the standard C libraries you are using.
As a final note for those of us unfamiliar with the ordering of Korean, you may want to try to describe the desired ordering rather than just the problem ordering.
Upvotes: 2