liyuhao
liyuhao

Reputation: 375

When we write a string to a file, why do we need to care about the encoding?

I read Python2 Unicode HOWTO and Unicode In Python, Completely Demystified to understand Python's Unicode system, and I met some code like this:

f = open('test.txt','w')
f.write(uni.encode('utf-8'))
f.close()

Why does a unicode str needed to be encoded before it was written into a file?

I know the default encoding is ASCII, so there will be an error because out of range.

But as I write it to file, Isn't it just copying the bits of uni in RAM to file, why does the program need to care about the encoding?

Upvotes: 0

Views: 65

Answers (1)

tdelaney
tdelaney

Reputation: 77357

Unicode characters are abstract entities called code points and have several encodings such as UTF32, UTF16 and UTF8. It would take 6 bytes per character to express all of the characters in a single entity (and even then, unicode has non-spacing characters so one could argue that the size is even bigger). To keep things confusing, many systems use "code pages" that existed before Unicode was standardized which are different mappings between bits and the characters they display.

Python's unicode characters are UTF16 in RAM. So right away we see an issue. If you want to write as UTF8, the in-memory string isn't going to work. Python needs to read the in-memory UTF16 string and write a UTF8 string.

There is another subtle issue which is that Intel based processors are "little endian" but Unicode multi-byte encodings are "big endian" (meaning bytes within a word are ordered differently). Even if you want to write UTF-16, changes must be made. Because of this little/big problem, it is common to write a BOM (Byte order Mark) at the front of strings so that encoders can guess the format.

Characters can be expressed in many ways (encodings), so what should the default be? its a historical thing really. Since ACSII was the way it was done thoughout history (well, unix history at least), its still the default.

When writing non-binary data, we always have to pass through some sort of codec. That's the price we pay for the time it took multi-lingual computing to mature and for computing systems to become powerful enough to deal with it. No way would my Commodore 64 have been able to deal with Phoenician.

Upvotes: 1

Related Questions