Reputation: 410
I am reading in file as follows:
def main(src):
with open(src, encoding='UTF-8') as incoming:
for line in incoming:
data = line
print(data)
del line
Code gets struck at this line (I don't know if it will show in browser):
<DT><A HREF="https://www.youtube.com/watch?v=-ygKS7WU4YU" ADD_DATE="1421587655">?*** EarAbuse ♛ 'Pppppp' (Official & Uncensored) - YouTube</A>
Notice that Black Chess Queen (i.e. \u265b) just after the words EarAbuse
seems to be causing the problem as reported in-
Traceback (most recent call last):
File "a.py", line 18, in <module>
moduleName.main(fileName)
File "C:\Users\Systems\Desktop\merc\bm\chrome.py", line 53, in main
print(data)
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u265b' in position
87: character maps to <undefined>
I have-
data = repr(line)
errors="surrogateescape"
while opening fileNo love. Also charbase says that python escape for that BCQ is u'\u265b'
, what does it mean & am I not implementing it already?
Edit: Strangely, typing print('\u265b')
or print(♛)
in IDLE works fine without any error & shows that beautiful BCQ - so what is wrong, why won't my code read beyond this line?
Upvotes: 1
Views: 6243
Reputation: 148890
The error is (almost) self explainatory. It says that Python tries to encode the string in Windows 1252 character set, that cannot represent '\u265b'
. It works fine in idle, because idle is a GUI application is only limited by the glyphs that the font can represent, whereas a console application con only display the 256 characters of the code page of the console.
You should explicitely convert the string into a byte array in the correct code page, with errors='replace'
:
for line in incoming:
data = line
print(data.encode('cp1252', errors='replace'))
Of course it will display a ?
instead of the ♛
as the console driver cannot display that UTF character, but you won't get any error.
If you do not like the b'...'
saying that you are printing 8 bits characters, just convert back to string again :
print(data.encode('cp1252', errors='replace').decode('cp1252'))
The forward-reverse encoding just ensures that all characters are now printable on the console (or are replaced).
Upvotes: 3
Reputation: 52039
The problem occurs when you are trying to print the BCQ character. I would guess that your console encoding/locale is not capable of emitting all Unicode code points - i.e. it is either ASCII or a 256-character codepage.
Instead of printing, try:
import sys
sys.stdout.buffer.write(data.encode('utf8'))
Upvotes: 2