Reputation: 1807
I have this query in Postgres where I'm ordering a small amount of rows according to a varchar field. There seems to be an error in ordering UTF8 strings in Postgres:
For example:
'W' in UTF-8 is 87, while 'g' is 103, but running SELECT 'W' < 'g';
will return false
while running SELECT convert_to('W', 'SQL_ASCII') < convert_to('g', 'SQL_ASCII')';
will return true
.
The collation is en_US.UTF-8
.
Is there a good explanation for this behavior? And how to avoid it?
Upvotes: 1
Views: 1460
Reputation: 125284
This will show the ascii collation ordering of some of the first unicode code points, if you are using the utf8 encoding:
select s, chr(s) from generate_series(32, 255) s order by chr(s) collate "C";
Now the same for the pt_BR (brazilian portuguese) collation:
select s, chr(s) from generate_series(32, 255) s order by chr(s) collate "pt_BR";
What you call collation (en_US.UTF-8
) is the collation before the dot and the encoding after the dot.
Upvotes: 1
Reputation: 28084
The ordering is not on the Unicode code points, but defined by the collation. And in UTF-8 we have 'A'<'a'<'B'<'b' etc.
Most people (except coders) expect this ordering. But feel free to collate with ASCII where you need it.
Upvotes: 1