alex.bour
alex.bour

Reputation: 2964

Column sorting in PostgreSQL is different between macOS and Ubuntu using same collation

I created a database with UTF8 encoding and fr_FR collation on both my Mac and Ubuntu server like this:

CREATE DATABASE my_database OWNER 'admin' TEMPLATE 'template0' ENCODING 'UTF8' LC_COLLATE 'fr_FR.UTF-8' LC_CTYPE 'fr_FR.UTF-8';

On both, I queried the collation:

show lc_collate;

and obtained:

fr_FR.UTF-8

Then, I tried to sort the same database and didn't obtain same results:

SELECT winery FROM usr_wines WHERE user_id=1 AND status=1 ORDER BY winery LIMIT 5;

1 - On macOS:

 a space before the a
A New record
Aa
Altesinoo
Aé

2- On Ubuntu 14.04:

Aa
Aé
Altesino
A New Wine
 a space before a

On Ubuntu, I have installed the desired locales and create a new collation:

CREATE COLLATION "fr_FR.utf8" (LOCALE = "fr_FR.utf8")

select * from pg_collation;

  collname  | collnamespace | collowner | collencoding | collcollate | collctype  
------------+---------------+-----------+--------------+-------------+------------
 default    |            11 |        10 |           -1 |             | 
 C          |            11 |        10 |           -1 | C           | C
 POSIX      |            11 |        10 |           -1 | POSIX       | POSIX
 C.UTF-8    |            11 |        10 |            6 | C.UTF-8     | C.UTF-8
 en_US      |            11 |        10 |            6 | en_US.utf8  | en_US.utf8
 en_US.utf8 |            11 |        10 |            6 | en_US.utf8  | en_US.utf8
 ucs_basic  |            11 |        10 |            6 | C           | C
 fr_FR      |          2200 |        10 |            6 | fr_FR.utf8  | fr_FR.utf8

On the mac, the fr_FR collation was already installed.

So why this difference in sorting ?

Another strange issue on Ubuntu: I fi tried to force the collation in my request:

SELECT winery FROM usr_wines WHERE user_id=1 AND status=1 ORDER BY winery COLLATE "fr_FR" LIMIT 5;

I got:

ERROR: collation "fr_FR" for encoding "UTF8" does not exist

Any help is welcome.

Upvotes: 3

Views: 2376

Answers (1)

Vao Tsun
Vao Tsun

Reputation: 51456

COLLATE "C" will give you predictable results on all platforms. Additional collations may be available depending on operating system support. And thus its behaviour totally depends on OS.

https://www.postgresql.org/docs/current/static/collation.html:

On all platforms, the collations named default, C, and POSIX are available. Additional collations may be available depending on operating system support. The default collation selects the LC_COLLATE and LC_CTYPE values specified at database creation time. The C and POSIX collations both specify "traditional C" behavior, in which only the ASCII letters "A" through "Z" are treated as letters, and sorting is done strictly by character code byte values.

If the operating system provides support for using multiple locales within a single program (newlocale and related functions), then when a database cluster is initialized, initdb populates the system catalog pg_collation with collations based on all the locales it finds on the operating system at the time. For example, the operating system might provide a locale named de_DE.utf8. initdb would then create a collation named de_DE.utf8 for encoding UTF8 that has both LC_COLLATE and LC_CTYPE set to de_DE.utf8. It will also create a collation with the .utf8 tag stripped off the name. So you could also use the collation under the name de_DE, which is less cumbersome to write and makes the name less encoding-dependent. Note that, nevertheless, the initial set of collation names is platform-dependent.

Upvotes: 1

Related Questions