bardosd
bardosd

Reputation: 413

Django, SQLite - Accurate ordering of strings with accented letters

Main problem:
I have a Python (3.4) Django (1.6) web app using an SQLite (3) database containing a table of authors. When I get the ordered list of authors some names with accented characters like ’Čapek’ and ’Örkény’ are the end of list instead of at (or directly after) section ’c’ and ’o’ of the list.

My 1st try:
SQLite can accept collation definitions. I searched for one that was made to order UTF-8 strings correctly for example Localized and Unicode collation in Android (Accented Search in sqlite (android)) but found none.

My 2nd try:
I found an old closed Django ticket about my problem: https://code.djangoproject.com/ticket/8384 It suggests sorting with Python as workaround. I found it quite unsatisfying. Firstly if I sort with a Python method (like below) instead of ordering at model level I cannot use generic views. Secondly ordering with a Python method returns the very same result as the SQLite order_by does: ’Čapek’ and ’Örkény’ are placed after section 'z'.

author_list = sorted(Author.objects.all(), key=lambda x: (x.lastname, x.firstname))

How could I get the queryset ordered correctly?

Upvotes: 3

Views: 1655

Answers (2)

bardosd
bardosd

Reputation: 413

Thanks to the link CL wrote in his comment, I managed to overcome the difficulties that I replied about. I answer my question to share the piece of code that worked because using Pyuca to sort querysets seems to be a rare and undocumented case.

# import section
from pyuca import Collator

# Calling Collator() takes some seconds so you should create it as reusable variable.
c = Collator()

# ...

# main part:
author_list = sorted(Author.objects.all(), key=lambda x: (c.sort_key(x.lastname), c.sort_key(x.firstname)))

The point is to use sort_key method with the attribute you want to sort by as argument. You can sort by multiple attributes as you see in the example.

Last words: In my language (Hungarian) we use four different accented version of the Latin letter ‘o’: ‘o’, ’ó’, ’ö’, ’ő’. ‘o’ and ‘ó’ are equal in sorting, and ‘ö’ and ‘ő’ are equal too, and ‘ö’/’ő’ are after ‘o’/’ó’. In the default collation table the four letters are equal. Now I try to find a way to define or find a localized collation table.

Upvotes: 2

loopbackbee
loopbackbee

Reputation: 23322

You could create a new field in the table, fill it with the result of unidecode, then sort according to it.

Using a property to provide get/set methods could help in keeping the fields in sync.

Upvotes: 0

Related Questions