mpcabd
mpcabd

Reputation: 1807

Postgres UTF8 ordering

I have this query in Postgres where I'm ordering a small amount of rows according to a varchar field. There seems to be an error in ordering UTF8 strings in Postgres:

For example:

'W' in UTF-8 is 87, while 'g' is 103, but running SELECT 'W' < 'g'; will return false while running SELECT convert_to('W', 'SQL_ASCII') < convert_to('g', 'SQL_ASCII')'; will return true.

The collation is en_US.UTF-8.

Is there a good explanation for this behavior? And how to avoid it?

Upvotes: 1

Views: 1460

Answers (2)

Clodoaldo Neto
Clodoaldo Neto

Reputation: 125284

This will show the ascii collation ordering of some of the first unicode code points, if you are using the utf8 encoding:

select s, chr(s) from generate_series(32, 255) s order by chr(s) collate "C";

Now the same for the pt_BR (brazilian portuguese) collation:

select s, chr(s) from generate_series(32, 255) s order by chr(s) collate "pt_BR";

What you call collation (en_US.UTF-8) is the collation before the dot and the encoding after the dot.

Upvotes: 1

Daniel
Daniel

Reputation: 28084

The ordering is not on the Unicode code points, but defined by the collation. And in UTF-8 we have 'A'<'a'<'B'<'b' etc.

Most people (except coders) expect this ordering. But feel free to collate with ASCII where you need it.

Upvotes: 1

Related Questions