Reputation: 57

How to sorting list of string combine Japanese and Latin in python

I have a problem with sorting in python. Can any body help me! please. Thanks a lot! I want sorting list follow like sorting in excel List original:

table = [
    u"女言葉の消失",  # 2
    u"キセキ",  # 3
    u"ふしぎなくすり",  # 4
    u"ｶｷｸｹｺ",  # 5
    u"嘘憑きとサルヴァドール",  # 1
    u"愛と勇気の三度笠ポン太",  # 0
    u"きせき",
    "漢字",
    "a",
    "A",
    "b",
    "1",
    "B"
]

Result sorting in python

sorted(table)

['1', 'A', 'B', 'a', 'b', 'きせき', 'ふしぎなくすり', 'キセキ', '嘘憑きとサルヴァドール', '女言葉の消失', '愛と勇気の三度笠ポン太', '漢字', 'ｶｷｸｹｺ']

Sorting in excel:

1,
a,
A,
B
b",
ｶｷｸｹｺ", 
きせき",
キセキ", 
ふしぎなくすり", 
嘘憑きとサルヴァドール",  
女言葉の消失", 
愛と勇気の三度笠ポン太",
漢字"

Upvotes: 1

Answers (2)

Andj

Reputation: 1447

An old question, but I will suggest a solution for future reference. I'd be inclined to use a PyICU collator for sorting. By default ICU provides a collation based on JIS X 4061. It also provides an alternative sort based on radical and stroke order.

table = [
    "女言葉の消失",
    "キセキ",
    "ふしぎなくすり",
    "ｶｷｸｹｺ",
    "嘘憑きとサルヴァドール",
    "愛と勇気の三度笠ポン太",
    "きせき",
    "漢字",
    "a",
    "A",
    "b",
    "1",
    "B"
]
import icu
loc1 = icu.Locale("ja")
collator1 = icu.Collator.createInstance(loc1)
sorted_table1 = sorted(table, key=collator1.getSortKey)

This will sort the list as:

[
    '1', 
    'a', 
    'A', 
    'b', 
    'B', 
    'ｶｷｸｹｺ', 
    'キセキ', 
    'きせき', 
    'ふしぎなくすり', 
    '愛と勇気の三度笠ポン太', 
    '嘘憑きとサルヴァドール', 
    '漢字', 
    '女言葉の消失'
]

The second approach is:

loc2 = icu.Locale.forLanguageTag("ja-u-co-unihan")
collator2 = icu.Collator.createInstance(loc2)
sorted_table2 = sorted(table, key=collator2.getSortKey)

This gives the sorted list:

[
    '1', 
    'a', 
    'A', 
    'b', 
    'B', 
    'ｶｷｸｹｺ', 
    'キセキ', 
    'きせき', 
    'ふしぎなくすり', 
    '嘘憑きとサルヴァドール', 
    '女言葉の消失', 
    '愛と勇気の三度笠ポン太', 
    '漢字'
]

The questions desired output is

[
    "1",
    "a",
    "A",
    "B",
    "b",
    "ｶｷｸｹｺ", 
    "きせき",
    "キセキ", 
    "ふしぎなくすり", 
    "嘘憑きとサルヴァドール",  
    "女言葉の消失", 
    "愛と勇気の三度笠ポン太",
    "漢字"
]

The key difference between the ICU sort and the excel sort provided in the question is the relative ordering of "きせき" and "キセキ". In ICU they are given the same weight, i.e. they have identical sort keys, although the raw collation elements have minor differences:

For "キセキ":
Sort key: 5E 14 22 14 , 07 , 07 .
Raw collation elements: [7A14,05,u05,q1][7A22,05,u05,q1][7A14,05,u05,q1]

For "きせき":
Sort key: 5E 14 22 14 , 07 , 07 .
Raw collation elements: [7A14,05,u05][7A22,05,u05][7A14,05,u05]

If we check the strength of the collator:

collator2.getStrength()
# 2

So the collator is set to a Tertiary level of comparison (strength), for Latin script collation this is equivalent to case level distinctions in the sort. But Japanese sorting uses a Quaternary level in the collation table, to distinguish between Hiragana and Katakana, and obtain a sort compliant with JIS X 4061.

collator2.setStrength(icu.Collator.QUATERNARY)
sorted_table3 = sorted(table, key=collator2.getSortKey)

this gives:

[
    '1',
    'a',
    'A',
    'b',
    'B',
    'ｶｷｸｹｺ',
    'きせき',
    'キセキ',
    'ふしぎなくすり',
    '嘘憑きとサルヴァドール',
    '女言葉の消失',
    '愛と勇気の三度笠ポン太',
    '漢字'
]

Upvotes: 0

Lucas Scott

Reputation: 475

Python is sorting the character based on their ordinal unicode value (the order they appear in unicode) which works "ok" for most cases, except for Japanese kanji. I don't know Japanese, but may of the symbols in your sample set appear to be Kanji and not hiragana or katakana.

>>> table = [
...     u"女言葉の消失",  # 2
...     u"キセキ",  # 3
...     u"ふしぎなくすり",  # 4
...     u"ｶｷｸｹｺ",  # 5
...     u"嘘憑きとサルヴァドール",  # 1
...     u"愛と勇気の三度笠ポン太",  # 0
...     u"きせき",
...     "漢字",
...     "a",
...     "A",
...     "b",
...     "1",
...     "B"
... ]
>>> for t in sorted(table):
...     print([ord(c) for c in t])
... 
[49]
[65]
[66]
[97]
[98]
[12365, 12379, 12365]
[12405, 12375, 12366, 12394, 12367, 12377, 12426]
[12461, 12475, 12461]
[22040, 24977, 12365, 12392, 12469, 12523, 12532, 12449, 12489, 12540, 12523]
[22899, 35328, 33865, 12398, 28040, 22833]
[24859, 12392, 21191, 27671, 12398, 19977, 24230, 31520, 12509, 12531, 22826]
[28450, 23383]
[65398, 65399, 65400, 65401, 65402]

There is an interesting article here that explains the difficulties of getting sorting correct here.

Upvotes: 0

How to sorting list of string combine Japanese and Latin in python

Answers (2)

Related Questions