user3051951
user3051951

Reputation: 41

How to read unicode files in python (not UTF-8)

How to read unicode files in python 2.x (not UTF-8, unknown encoding)

I tried to find a way to read unicode files. I searched on the Internet for a long long long time. But I can't find it. What I found are the way to read files such as encoded as UTF-8. I know, that when I need to read UTF-8, I can use codecs.

codecs.open('unicode2.txt',encoding='utf-8')

Using this I can read UTF-8 files. But I want to know how to read unicode files. Many many post that titled 'the way to read unicode files in python' actually tells a way to read files such as UTF-8, UTF-16.

Why anyone didn't explain a way to read 'UNICODE' files?

this is an example of hex value of text files I try to read with python.

This is Korean, "파이썬에서 한글 읽기"

(FF FE) 0C D3 74 C7 6C C3 D0 C5 1C C1 20 00 5C D5 00 AE 20 00 7D C7 30 AE

(FF FE) means byte order. And each 2 byte means character. As you can see, space is written as '20 00', not '20' In unicode, space is written as '20 00'. But UTF-8, space is written as '20'.

There is no way to use codecs like "codecs.open('unicode2.txt',encoding='**unicode**')"

Is there really no way to read "unicode" files in python?

Upvotes: 2

Views: 7124

Answers (1)

jfs
jfs

Reputation: 414139

A disk file is a sequence of bytes that you can interpret as a text if you use character encoding such as utf-8, utf-16le. "unicode" is not a character encoding.

There Ain't No Such Thing As Plain Text.

Your example file might use utf-16le encoding:

>>> text = u"파이썬"
>>> text.encode('utf-16le')
'\x0c\xd3t\xc7l\xc3'
>>> text.encode('utf-16le').encode('hex')
'0cd374c76cc3'

b'\xff\xfe' == codecs.BOM_UTF16_LE is a BOM for UTF-16 (LE) character encoding. To read such file, you could use utf-16 encoding (BE or LE are chosen based on BOM):

import codecs

with codecs.open('filename', encoding='utf-16') as file:
    text = file.read()

Upvotes: 5

Related Questions