Write bytes literal with undefined character to CSV file (Python 3)

Question

Using Python 3.4.2, I want to get a part of a website. According to the meta tags, that website is encoded with iso-8859-1. And I want to write one part (along with other parts) to a CSV file.

However, this part contains an undefined character with the hex value 0x8b. In order to preserve the part as good as possible, I want to write it as is into the CSV file. However, Python doesn't let me do it.

Here's a minimal example:

import urllib.request
import urllib.parse
import csv

if __name__ == "__main__":
    with open("bytewrite.csv", "w", newline="") as csvfile:
        a = b'\x8b' # byte literal by urllib.request
        b = a.decode("iso-8859-1")

        w = csv.writer(csvfile)
        w.writerow([b])

And this is the output:

Traceback (most recent call last):
  File "D:\Eigene\Dateien\Code\Python\writebyte.py", line 12, in 
    w.writerow([b])
  File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x8b' in position 0: character maps to

Eventually, I did it manually. It was just copy and paste with Notepad++, and according to a hex editor the value was inserted correctly. But how can I do it with Python 3? Why does Python even care what 0x8b stands for, instead of just writing it to the file?

It further irritates me that according to iso8859_1.py (and also cp1252.py) in C:\Python34\lib\encodings\ the lookup table seems to not interfere:

# iso8859_1.py
    '\x8b'     #  0x8B -> 
# cp1252.py
    '\u2039'   #  0x8B -> SINGLE LEFT-POINTING ANGLE QUOTATION MARK

Mark Tolonen · Accepted Answer

Quoted from csv docs:

Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:

import csv
with open('some.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

The same applies to writing in something other than the system default encoding: specify the encoding argument when opening the output file.

What is happening is you've decoded to Unicode from iso-8859-1, but getpreferredencoding() returns cp1252 and the Unicode character \x8b is not supported in that encoding.

Corrected minimal example:

import csv
with open('bytewrite.csv', 'w', encoding='iso-8859-1', newline='') as csvfile:
    a = b'\x8b'
    b = a.decode("iso-8859-1")
    w = csv.writer(csvfile)
    w.writerow([b])

Write bytes literal with undefined character to CSV file (Python 3)

Answers (2)

Related Questions