Reputation: 63
I have folders and its name includes few Korean characters.
When I read list of folder name by os.listdir
,
its name value is purely different with normal string.
Example:
\xe1\x84\x82\xe1\x85\xae
)\xeb\x88\x84
)What makes difference?
We can estimate it is from os.listdir()
gives confusing with some encoding..
Upvotes: 1
Views: 157
Reputation: 61654
Both of these are the same encoding (UTF-8), but...
"누" = (\xe1\x84\x82\xe1\x85\xae)
This represents the character as composed of the two jamo (the 24 building blocks of the Korean (hangeul) alphabet):
>>> import unicodedata
>>> x = b'\xe1\x84\x82'.decode('utf-8')
>>> y = b'\xe1\x85\xae'.decode('utf-8')
>>> unicodedata.name(x)
'HANGUL CHOSEONG NIEUN'
>>> unicodedata.name(y)
'HANGUL JUNGSEONG U'
"누" in python console = (\xeb\x88\x84)
Whereas when you actually type the character in a console window, you (apparently) get a precomposed character:
>>> z = b'\xeb\x88\x84'.decode('utf-8')
>>> unicodedata.name(z)
'HANGUL SYLLABLE NU'
Upvotes: 2