RinkyPinku
RinkyPinku

Reputation: 410

Unable to print Unicode char despite reading as UTF-8

I am reading in file as follows:

def main(src):
    with open(src, encoding='UTF-8') as incoming:
        for line in incoming:
            data = line
            print(data)
        del line

Code gets struck at this line (I don't know if it will show in browser):

    <DT><A HREF="https://www.youtube.com/watch?v=-ygKS7WU4YU" ADD_DATE="1421587655">?*** EarAbuse ♛ &#39;Pppppp&#39; (Official &amp; Uncensored) - YouTube</A>

Notice that Black Chess Queen (i.e. \u265b) just after the words EarAbuse seems to be causing the problem as reported in-

Traceback (most recent call last):
  File "a.py", line 18, in <module>
    moduleName.main(fileName)
  File "C:\Users\Systems\Desktop\merc\bm\chrome.py", line 53, in main
    print(data)
  File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u265b' in position
87: character maps to <undefined>

I have-

  1. read docs.python.org/3/howto/unicode.html
  2. used data = repr(line)
  3. used errors="surrogateescape" while opening file

No love. Also charbase says that python escape for that BCQ is u'\u265b', what does it mean & am I not implementing it already?

Edit: Strangely, typing print('\u265b') or print(♛) in IDLE works fine without any error & shows that beautiful BCQ - so what is wrong, why won't my code read beyond this line?

Upvotes: 1

Views: 6243

Answers (2)

Serge Ballesta
Serge Ballesta

Reputation: 148890

The error is (almost) self explainatory. It says that Python tries to encode the string in Windows 1252 character set, that cannot represent '\u265b'. It works fine in idle, because idle is a GUI application is only limited by the glyphs that the font can represent, whereas a console application con only display the 256 characters of the code page of the console.

You should explicitely convert the string into a byte array in the correct code page, with errors='replace' :

for line in incoming:
    data = line
    print(data.encode('cp1252', errors='replace'))

Of course it will display a ? instead of the as the console driver cannot display that UTF character, but you won't get any error.

If you do not like the b'...' saying that you are printing 8 bits characters, just convert back to string again :

    print(data.encode('cp1252', errors='replace').decode('cp1252'))

The forward-reverse encoding just ensures that all characters are now printable on the console (or are replaced).

Upvotes: 3

ErikR
ErikR

Reputation: 52039

The problem occurs when you are trying to print the BCQ character. I would guess that your console encoding/locale is not capable of emitting all Unicode code points - i.e. it is either ASCII or a 256-character codepage.

Instead of printing, try:

import sys

sys.stdout.buffer.write(data.encode('utf8'))

Upvotes: 2

Related Questions