ilias iliadis
ilias iliadis

Reputation: 631

How to sort latin after local language in python 3?

There are many situations where the user's language is not a "latin" script (examples include: Greek, Russian, Chinese). In most of these cases a sorting is done by

Or even more specific for the rest...:

is it possible to select the sort based on script?

Example1: Chinese script first then Latin-Greek-Arabic (or even more...)

Example2: Greek script first then Latin-Arabic-Chinese (or even more...)

What is the most effective and pythonic way to create a sort like any of these? (by «any» I mean either the simple «selected script first» and rest as in unicode sort, or the more complicated «selected script first» and then a specified order for rest of the scripts)

Upvotes: 2

Views: 837

Answers (2)

Andj
Andj

Reputation: 1462

An old question, but I would like to offer an alternative approach making use of PyICU and ICU4C.

ICU4C allows you to reorder sorting results based on script. There are a few ways to do this. ICU4C uses the CLDR Collation Algorithm, a tailoring of the Unicode Collation Algorithm. Since the question did not indicate what the core language was for collation, I will assume we are using the Root collation.

Unicode has two extensions to BCP47, allowing controlling or overriding of various aspects of collation. Specifically -u-kr- allows the reordering of scripts within the collation. You specify a hyphen separated list of four letter ISO 15924 script codes. This will control the sorting with respect to script order.

  1. create a locale, using a BCP-47 language tag. In this case, I will use und-u-kr-Adlm-Hans-Ethi-Latn. und is undefined, it doesn't represent an specific language. It will use the root collation. I could have used en-u-kr-Adlm-Hans-Ethi-Latn since English also uses the root collation.
  2. create a collator using this locale.
  3. specify the key for sorting: collator.getSortKey.

The language tag specifies sorting Adlam, then Simplified Chinese, followed by Ethiopic script, then Latin. All other scripts are sorted in their order in the root collation.

import icu
locale = icu.Locale.forLanguageTag('und-u-kr-Adlm-Hans-Ethi-Latn')
collator = icu.Collator.createInstance(locale)
languages = ['𞠗𞢱𞡓𞠣', 'नेपाली','ꛀꛣꚧꚳ','白语','ማርኛ','français', '𞤊𞤵𞥅𞤼𞤢 𞤔𞤢𞤤𞤮𞥅']

print(sorted(languages))
# ['français', 'नेपाली', 'ማርኛ', '白语', 'ꛀꛣꚧꚳ', '𞠗𞢱𞡓𞠣', '𞤊𞤵𞥅𞤼𞤢 𞤔𞤢𞤤𞤮𞥅']

print(sorted(languages, key=collator.getSortKey))
# ['𞤊𞤵𞥅𞤼𞤢 𞤔𞤢𞤤𞤮𞥅', '白语', 'ማርኛ', 'français', 'नेपाली', 'ꛀꛣꚧꚳ', '𞠗𞢱𞡓𞠣']

Upvotes: 0

Tom Zych
Tom Zych

Reputation: 13596

Interesting question. Here’s some sample code that classifies strings according to the writing system of the first character.

import unicodedata

words = ["Japanese",         # English
         "Nihongo",          # Japanese, rōmaji
         "にほんご",          # Japanese, hiragana
         "ニホンゴ",          # Japanese, katakana
         "日本語",            # Japanese, kanji
         "Японский язык",    # Russian
         "जापानी भाषा"        # Hindi (Devanagari)
]

def wskey(s):
    """Return a sort key that is a tuple (n, s), where n is an int based
    on the writing system of the first character, and s is the passed
    string. Writing systems not addressed (Devanagari, in this example)
    go at the end."""

    sort_order = {
        # We leave gaps to make later insertions easy
        'CJK' : 100,
        'HIRAGANA' : 200,
        'KATAKANA' : 200,  # hiragana and katakana at same level
        'CYRILLIC' : 300,
        'LATIN' : 400
    }

    name = unicodedata.name(s[0], "UNKNOWN")
    first = name.split()[0]
    n = sort_order.get(first, 999999);
    return (n, s)

words.sort(key=wskey)
for s in words:
    print(s)

In this example, I am sorting hiragana and katakana (the two Japanese syllabaries) at the same level, which means pure-katakana strings will always come after pure-hiragana strings. If we wanted to sort them such that the same syllable (e.g., に and ニ) sorted together, that would be trickier.

Upvotes: 3

Related Questions