Reputation: 63
I have a unicode string:
s = "ᠤᠷᠢᠳᠤ ᠲᠠᠯᠠ ᠶᠢᠨ ᠬᠠᠪᠲᠠᠭᠠᠢ ᠬᠡᠪᠲᠡᠭᠡ"
the split method it returns is somewhat changed, with a \u180e
in the second word.
>>> print(s.split())
['ᠤᠷᠢᠳᠤ', 'ᠲᠠᠯ\u180eᠠ', 'ᠶᠢᠨ', 'ᠬᠠᠪᠲᠠᠭᠠᠢ', 'ᠬᠡᠪᠲᠡᠭᠡ']
What I want to get is:
['ᠤᠷᠢᠳᠤ', 'ᠲᠠᠯᠠ ᠶᠢᠨ', 'ᠶᠢᠨ', 'ᠬᠠᠪᠲᠠᠭᠠᠢ', 'ᠬᠡᠪᠲᠡᠭᠡ']
What is the reason causing this, and how to solve it?
Upvotes: 4
Views: 219
Reputation: 26
I don't think the problem is with the split function, but with the list itself.
>>> s = ["ᠤᠷᠢᠳᠤ ᠲᠠᠯᠠ ᠶᠢᠨ ᠬᠠᠪᠲᠠᠭᠠᠢ ᠬᠡᠪᠲᠡᠭᠡ"]
>>> print(s)
['ᠤᠷᠢᠳᠤ ᠲᠠᠯ\u180eᠠ ᠶᠢᠨ ᠬᠠᠪᠲᠠᠭᠠᠢ ᠬᠡᠪᠲᠡᠭᠡ']
You should still be able to use the list normally, because it corrects itself when the element is used.
>>> s = "ᠤᠷᠢᠳᠤ ᠲᠠᠯᠠ ᠶᠢᠨ ᠬᠠᠪᠲᠠᠭᠠᠢ ᠬᠡᠪᠲᠡᠭᠡ"
>>> s = s.split()
>>> [print(e) for e in s]
ᠤᠷᠢᠳᠤ
ᠲᠠᠯᠠ
ᠶᠢᠨ
ᠬᠠᠪᠲᠠᠭᠠᠢ
ᠬᠡᠪᠲᠡᠭᠡ
Upvotes: 1
Reputation: 544
According to Wikipedia: https://en.wikipedia.org/wiki/Whitespace_character#Unicode
U+180E is a space character until Uncode 6.3.0 so if python implements a earlier Unicode spec than i guess split() would break on all space characters. You could work arround this by giving split an argument if you want to only split on certain characters (s.split(" ")
) that would give you:
>>> s.split(" ")
['ᠤᠷᠢᠳᠤ', 'ᠲᠠᠯ\u180eᠠ\u202fᠶᠢᠨ', 'ᠬᠠᠪᠲᠠᠭᠠᠢ', 'ᠬᠡᠪᠲᠡᠭᠡ']
Upvotes: 1