philshem
philshem

Reputation: 25341

Writing unicode with python - what is wrong with this character

With python 2.7 I am reading as unicode and writing as utf-16-le. Most characters are correctly interpreted. But some are not, for example, u'\u810a', also known as unichr(33034). The following code code does not write correctly:

import codecs
with open('temp.txt','w') as temp:
    temp.write(codecs.BOM_UTF16_LE)     
    text = unichr(33034)  # text = u'\u810a'
    temp.write(text.encode('utf-16-le'))

But either of these things, when replaced above, make the code work.

  1. unichr(33033) and unichr(33035) work correctly.

  2. 'utf-8' encoding (without BOM, byte-order mark).

How can I recognize characters that won't write correctly, and how can I write a 'utf-16-le' encoded file with BOM that either prints these characters or some replacement?

Upvotes: 1

Views: 750

Answers (3)

Mark Tolonen
Mark Tolonen

Reputation: 177901

@Joni's answer is the root of the problem, but if you use codecs.open instead it always opens in binary mode, even if not specified. Using the utf16 codec also automatically writes the BOM using native endian-ness as well:

import codecs
with codecs.open('temp.txt','w','utf16') as temp:
    temp.write(u'\u810a')

Hex dump of temp.txt:

FF FE 0A 81

Reference: codecs.open

Upvotes: 1

Jordan
Jordan

Reputation: 32542

You're already using the codecs library. When working with that file, you should swap out using open() with codecs.open() to transparently handle encoding.

import codecs
with codecs.open('temp.txt', 'w', encoding='utf-16-le') as temp:
    temp.write(unichr(33033))
    temp.write(unichr(33034))
    temp.write(unichr(33035))

If you have a problem after that, you might have an issue with your viewer, not your Python script.

Upvotes: 0

Joni
Joni

Reputation: 111349

You are opening the file in text mode, which means that line-break characters/bytes will be translated to the local convention. Unfortunately the character you are trying to write includes a byte, 0A, that is interpreted as a line break and does not make it to the file correctly.

Open the file in binary mode instead:

open('temp.txt','wb')

Upvotes: 4

Related Questions