Postgres ordering of UTF-8 characters

Question

I'm building a small app that includes Esperanto words in my database, so I have words like ĉapelojn and brakhorloĝo, with "special" characters.

Using PostgreSQL 9.4.4 I have a words table with the following schema:

lingvoj_dev=# \d words
                                      Table "public.words"
   Column    |            Type             |                     Modifiers
-------------+-----------------------------+----------------------------------------------------
 id          | integer                     | not null default nextval('words_id_seq'::regclass)
 translated  | character varying(255)      |
 meaning     | character varying(255)      |
 times_seen  | integer                     |
 inserted_at | timestamp without time zone | not null
 updated_at  | timestamp without time zone | not null
Indexes:
    "words_pkey" PRIMARY KEY, btree (id)

But the following query gives some strange output:

lingvoj_dev=# SELECT w."translated" FROM "words" AS w ORDER BY w."translated" desc limit 10; 
translated
------------
 ĉu
 ŝi
 ĝi
 ĉevaloj
 ĉapelojn
 ĉapeloj
 ĉambro
 vostojn
 volas
 viro
(10 rows)

The ordering is inconsistent - I'd be okay with all of the words starting with special characters being at the end, but all of the words starting with ĉ should be grouped together and they're not! Why do ŝi and ĝi come in between ĉu and ĉevaloj?

The server encoding is UTF8, and the collation is en_AU.UTF-8.

edit: It looks like it's sorting all of the special characters as equivalent - it's ordering correctly based on the second character in each word. How do I make PostgreSQL see that ĉ, ŝ and ĝ are not equivalent?

klin · Accepted Answer

I'd be okay with all of the words starting with special characters being at the end...

Use collate "C":

SELECT w."translated" 
FROM "words" AS w 
ORDER BY w."translated" collate "C" desc limit 10;

See also Different behaviour in “order by” clause: Oracle vs. PostgreSQL

The query can be problematic when using ORM. The solution may be to recreate the database with the LC_COLLATE = C option, as suggested by the OP in the comment. There is one more option - change the collation for a single column:

ALTER TABLE "words" ALTER COLUMN "translated" TYPE text COLLATE "C";

Postgres ordering of UTF-8 characters

Answers (1)

Related Questions