Adam Piotrowski
Adam Piotrowski

Reputation: 3295

Strange sorting by text column on Ubuntu

It's the biggest surprise on the part of Postgres I've ever had to deal with. I made this simple test on ubuntu 14.04:

postgres=# select version();
                                                    version                                                    
---------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.4.0 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-55), 64-bit
(1 row)

postgres=# create table test (val text);
CREATE TABLE
postgres=# insert into test values ('a'), ('b'), ('-a'), ('-b');
INSERT 0 4
postgres=# select * from test order by val;
 val 
-----
 a
 -a
 b
 -b
(4 rows)

Obviously this is not what can be considered correct. A similar result can be seen on sqlfiddle that uses

PostgreSQL version 9.3.6 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit

The same test on Windows servers

PostgreSQL 9.4.0, compiled by Visual C++ build 1800, 64-bit
PostgreSQL 9.1.2, compiled by Visual C++ build 1500, 32-bit

gives the correct result:

postgres=# select version();
                           version
-------------------------------------------------------------
 PostgreSQL 9.4.0, compiled by Visual C++ build 1800, 64-bit
(1 row)

postgres=# create table test (val text);
CREATE TABLE
postgres=# insert into test values ('a'), ('b'), ('-a'), ('-b');
INSERT 0 4
postgres=# select * from test order by val;
 val
-----
 -a
 -b
 a
 b
(4 rows)

I do not know how to deal with it. I need a stable server on Ubuntu, but now lost confidence in this version. Ubuntu users, please make the test and share the results. Any tips would be appreciated of course.


Thanks to Simo I found out what follows. Postgres has only two built-in collations: C and POSIX. Any other collations are provided by operating system. Postgres 9.3 server on sqlfiddle uses lc_collate = en_US.utf8 and my server uses pl_PL.utf8, what means the collations are provided by Ubuntu. To check it out I ran sort on terminal and obtained the same results:

me@ubu:~$ sort
a
b
-a
-b

a
-a
b
-b

Now I know the culprit, but it still does not solve my problem.

Upvotes: 1

Views: 139

Answers (2)

Daniel Vérité
Daniel Vérité

Reputation: 61526

my server uses pl_PL.utf8, what means the collations are provided by Ubuntu

Yes, but this can be overriden in various ways.

  • An individual expression can be assigned a collation, which is particularly relevant in an ORDER BY clause. If you use:

    select * from test order by val COLLATE "C";
    

    This will sort as what is to you the "correct result", independently of lc_collate.

  • An individual column can also be assigned a collation so that it doesn't have to be explicitly mentioned in every sort or comparison:

     create table test (val text COLLATE "C");
    
  • A database can also be assigned a collation that differs from the server's default, see LC_COLLATE clause in CREATE DATABASE.

See Collation support in PostgreSQL manual for more.

Upvotes: 2

klin
klin

Reputation: 121604

Unfortunately you will probably have to accept this. In short, on GNU systems in UTF locales all non alphanumeric characters are ignored during sorting. You can find more detailed explanation in this answer.

Upvotes: 2

Related Questions