Reputation: 57
I have a problem with sorting in python. Can any body help me! please. Thanks a lot! I want sorting list follow like sorting in excel List original:
table = [
u"女言葉の消失", # 2
u"キセキ", # 3
u"ふしぎなくすり", # 4
u"カキクケコ", # 5
u"嘘憑きとサルヴァドール", # 1
u"愛と勇気の三度笠ポン太", # 0
u"きせき",
"漢字",
"a",
"A",
"b",
"1",
"B"
]
Result sorting in python
sorted(table)
['1', 'A', 'B', 'a', 'b', 'きせき', 'ふしぎなくすり', 'キセキ', '嘘憑きとサルヴァドール', '女言葉の消失', '愛と勇気の三度笠ポン太', '漢字', 'カキクケコ']
Sorting in excel:
1,
a,
A,
B
b",
カキクケコ",
きせき",
キセキ",
ふしぎなくすり",
嘘憑きとサルヴァドール",
女言葉の消失",
愛と勇気の三度笠ポン太",
漢字"
Upvotes: 1
Views: 810
Reputation: 1447
An old question, but I will suggest a solution for future reference. I'd be inclined to use a PyICU collator for sorting. By default ICU provides a collation based on JIS X 4061. It also provides an alternative sort based on radical and stroke order.
table = [
"女言葉の消失",
"キセキ",
"ふしぎなくすり",
"カキクケコ",
"嘘憑きとサルヴァドール",
"愛と勇気の三度笠ポン太",
"きせき",
"漢字",
"a",
"A",
"b",
"1",
"B"
]
import icu
loc1 = icu.Locale("ja")
collator1 = icu.Collator.createInstance(loc1)
sorted_table1 = sorted(table, key=collator1.getSortKey)
This will sort the list as:
[
'1',
'a',
'A',
'b',
'B',
'カキクケコ',
'キセキ',
'きせき',
'ふしぎなくすり',
'愛と勇気の三度笠ポン太',
'嘘憑きとサルヴァドール',
'漢字',
'女言葉の消失'
]
The second approach is:
loc2 = icu.Locale.forLanguageTag("ja-u-co-unihan")
collator2 = icu.Collator.createInstance(loc2)
sorted_table2 = sorted(table, key=collator2.getSortKey)
This gives the sorted list:
[
'1',
'a',
'A',
'b',
'B',
'カキクケコ',
'キセキ',
'きせき',
'ふしぎなくすり',
'嘘憑きとサルヴァドール',
'女言葉の消失',
'愛と勇気の三度笠ポン太',
'漢字'
]
The questions desired output is
[
"1",
"a",
"A",
"B",
"b",
"カキクケコ",
"きせき",
"キセキ",
"ふしぎなくすり",
"嘘憑きとサルヴァドール",
"女言葉の消失",
"愛と勇気の三度笠ポン太",
"漢字"
]
The key difference between the ICU sort and the excel sort provided in the question is the relative ordering of "きせき" and "キセキ". In ICU they are given the same weight, i.e. they have identical sort keys, although the raw collation elements have minor differences:
For "キセキ":
Sort key: 5E 14 22 14 , 07 , 07 .
Raw collation elements: [7A14,05,u05,q1][7A22,05,u05,q1][7A14,05,u05,q1]
For "きせき":
Sort key: 5E 14 22 14 , 07 , 07 .
Raw collation elements: [7A14,05,u05][7A22,05,u05][7A14,05,u05]
If we check the strength of the collator:
collator2.getStrength()
# 2
So the collator is set to a Tertiary level of comparison (strength), for Latin script collation this is equivalent to case level distinctions in the sort. But Japanese sorting uses a Quaternary level in the collation table, to distinguish between Hiragana and Katakana, and obtain a sort compliant with JIS X 4061.
collator2.setStrength(icu.Collator.QUATERNARY)
sorted_table3 = sorted(table, key=collator2.getSortKey)
this gives:
[
'1',
'a',
'A',
'b',
'B',
'カキクケコ',
'きせき',
'キセキ',
'ふしぎなくすり',
'嘘憑きとサルヴァドール',
'女言葉の消失',
'愛と勇気の三度笠ポン太',
'漢字'
]
Upvotes: 0
Reputation: 475
Python is sorting the character based on their ordinal unicode value (the order they appear in unicode) which works "ok" for most cases, except for Japanese kanji. I don't know Japanese, but may of the symbols in your sample set appear to be Kanji and not hiragana or katakana.
>>> table = [
... u"女言葉の消失", # 2
... u"キセキ", # 3
... u"ふしぎなくすり", # 4
... u"カキクケコ", # 5
... u"嘘憑きとサルヴァドール", # 1
... u"愛と勇気の三度笠ポン太", # 0
... u"きせき",
... "漢字",
... "a",
... "A",
... "b",
... "1",
... "B"
... ]
>>> for t in sorted(table):
... print([ord(c) for c in t])
...
[49]
[65]
[66]
[97]
[98]
[12365, 12379, 12365]
[12405, 12375, 12366, 12394, 12367, 12377, 12426]
[12461, 12475, 12461]
[22040, 24977, 12365, 12392, 12469, 12523, 12532, 12449, 12489, 12540, 12523]
[22899, 35328, 33865, 12398, 28040, 22833]
[24859, 12392, 21191, 27671, 12398, 19977, 24230, 31520, 12509, 12531, 22826]
[28450, 23383]
[65398, 65399, 65400, 65401, 65402]
There is an interesting article here that explains the difficulties of getting sorting correct here.
Upvotes: 0