Reputation: 41
How to read unicode files in python 2.x
(not UTF-8, unknown encoding)
I tried to find a way to read unicode files. I searched on the Internet for a long long long time. But I can't find it. What I found are the way to read files such as encoded as UTF-8. I know, that when I need to read UTF-8, I can use codecs.
codecs.open('unicode2.txt',encoding='utf-8')
Using this I can read UTF-8 files. But I want to know how to read unicode files. Many many post that titled 'the way to read unicode files in python' actually tells a way to read files such as UTF-8, UTF-16.
Why anyone didn't explain a way to read 'UNICODE' files?
this is an example of hex value of text files I try to read with python.
This is Korean, "파이썬에서 한글 읽기
"
(FF FE) 0C D3 74 C7 6C C3 D0 C5 1C C1 20 00 5C D5 00 AE 20 00 7D C7 30 AE
(FF FE)
means byte order.
And each 2 byte means character. As you can see, space is written as '20 00', not '20'
In unicode, space is written as '20 00'. But UTF-8, space is written as '20'.
There is no way to use codecs like "codecs.open('unicode2.txt',encoding='**unicode**')
"
Is there really no way to read "unicode" files in python?
Upvotes: 2
Views: 7124
Reputation: 414139
A disk file is a sequence of bytes that you can interpret as a text if you use character encoding such as utf-8, utf-16le. "unicode" is not a character encoding.
There Ain't No Such Thing As Plain Text.
Your example file might use utf-16le
encoding:
>>> text = u"파이썬"
>>> text.encode('utf-16le')
'\x0c\xd3t\xc7l\xc3'
>>> text.encode('utf-16le').encode('hex')
'0cd374c76cc3'
b'\xff\xfe' == codecs.BOM_UTF16_LE
is a BOM for UTF-16 (LE) character encoding. To read such file, you could use utf-16 encoding (BE or LE are chosen based on BOM):
import codecs
with codecs.open('filename', encoding='utf-16') as file:
text = file.read()
Upvotes: 5