Reputation: 631
There are many situations where the user's language is not a "latin" script (examples include: Greek, Russian, Chinese). In most of these cases a sorting is done by
Or even more specific for the rest...:
is it possible to select the sort based on script?
Example1: Chinese script first then Latin-Greek-Arabic (or even more...)
Example2: Greek script first then Latin-Arabic-Chinese (or even more...)
What is the most effective and pythonic way to create a sort like any of these? (by «any» I mean either the simple «selected script first» and rest as in unicode sort, or the more complicated «selected script first» and then a specified order for rest of the scripts)
Upvotes: 2
Views: 837
Reputation: 1462
An old question, but I would like to offer an alternative approach making use of PyICU and ICU4C.
ICU4C allows you to reorder sorting results based on script. There are a few ways to do this. ICU4C uses the CLDR Collation Algorithm, a tailoring of the Unicode Collation Algorithm. Since the question did not indicate what the core language was for collation, I will assume we are using the Root collation.
Unicode has two extensions to BCP47, allowing controlling or overriding of various aspects of collation. Specifically -u-kr-
allows the reordering of scripts within the collation. You specify a hyphen separated list of four letter ISO 15924 script codes. This will control the sorting with respect to script order.
und-u-kr-Adlm-Hans-Ethi-Latn
. und
is undefined, it doesn't represent an specific language. It will use the root collation. I could have used en-u-kr-Adlm-Hans-Ethi-Latn
since English also uses the root collation.collator.getSortKey
.The language tag specifies sorting Adlam, then Simplified Chinese, followed by Ethiopic script, then Latin. All other scripts are sorted in their order in the root collation.
import icu
locale = icu.Locale.forLanguageTag('und-u-kr-Adlm-Hans-Ethi-Latn')
collator = icu.Collator.createInstance(locale)
languages = ['𞠗𞢱𞡓𞠣', 'नेपाली','ꛀꛣꚧꚳ','白语','ማርኛ','français', '𞤊𞤵𞥅𞤼𞤢 𞤔𞤢𞤤𞤮𞥅']
print(sorted(languages))
# ['français', 'नेपाली', 'ማርኛ', '白语', 'ꛀꛣꚧꚳ', '𞠗𞢱𞡓𞠣', '𞤊𞤵𞥅𞤼𞤢 𞤔𞤢𞤤𞤮𞥅']
print(sorted(languages, key=collator.getSortKey))
# ['𞤊𞤵𞥅𞤼𞤢 𞤔𞤢𞤤𞤮𞥅', '白语', 'ማርኛ', 'français', 'नेपाली', 'ꛀꛣꚧꚳ', '𞠗𞢱𞡓𞠣']
Upvotes: 0
Reputation: 13596
Interesting question. Here’s some sample code that classifies strings according to the writing system of the first character.
import unicodedata
words = ["Japanese", # English
"Nihongo", # Japanese, rōmaji
"にほんご", # Japanese, hiragana
"ニホンゴ", # Japanese, katakana
"日本語", # Japanese, kanji
"Японский язык", # Russian
"जापानी भाषा" # Hindi (Devanagari)
]
def wskey(s):
"""Return a sort key that is a tuple (n, s), where n is an int based
on the writing system of the first character, and s is the passed
string. Writing systems not addressed (Devanagari, in this example)
go at the end."""
sort_order = {
# We leave gaps to make later insertions easy
'CJK' : 100,
'HIRAGANA' : 200,
'KATAKANA' : 200, # hiragana and katakana at same level
'CYRILLIC' : 300,
'LATIN' : 400
}
name = unicodedata.name(s[0], "UNKNOWN")
first = name.split()[0]
n = sort_order.get(first, 999999);
return (n, s)
words.sort(key=wskey)
for s in words:
print(s)
In this example, I am sorting hiragana and katakana (the two Japanese syllabaries) at the same level, which means pure-katakana strings will always come after pure-hiragana strings. If we wanted to sort them such that the same syllable (e.g., に and ニ) sorted together, that would be trickier.
Upvotes: 3