Reputation: 127
I get above ERRORs when I try to scrape some text with Finish-Names from an 'url'. The solutions I tried and corresponding ERRORs, are commented below in the code. I neither know how to fix these, nor what the exact issue is. I'm a beginner in Python. Any help appreciated.
My Code:
from lxml import html
import requests
page = requests.get('url')
site = page.text # ERROR -> 'charmap' codec can't encode character '\x84' in
# position {x}: character maps to <undefined>
# site = site.encode('utf-8', errors='replace') # ERROR -> can't concat str to bytes
# site = site.encode('ascii', errors='replace') # ERROR -> can't concat str to bytes
with open('url.txt', 'a') as file:
try:
file.write(site + '\n')
except Exception as err:
file.write('an ERROR occured: ' + str(err) + '\n')
and the original Exception:
Traceback (most recent call last):
File "...\parse.py", line 12, in <module>
file.write(site + '\n') File
"...\python36\lib\encodings\cp1252.py", line 19, in encode return
codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x84' in position
12591: character maps to <undefined>
regards, Dominik
Upvotes: 2
Views: 17076
Reputation: 365767
If the exception is happening on page.text
, as you indicate:
When you ask a requests
response for its text
, it uses the encoding that the page claims to be in. If the page is wrong, that will fail, and usually raise a UnicodeDecodeError
.
For debugging problems like this, you should definitely print out what encoding requests
got from the server:
print(page.encoding)
A browser will usually just display mojibake. Sometimes, they'll even realize that the encoding is wrong and try to guess at the encoding. They'll rarely fail and refuse to display anything. That makes sense for something designed to display data immediately. It doesn't make sense for many programs designed to process data, or to store data for later (where you want to know there's a problem ASAP, not after you've stored 500GB of useless garbage), etc. That's why requests
doesn't try to hard to do magic.
If you know the encoding is, say, Latin-6/IO-8859-10 even though it claims to be something else, you can decode it manually:
site = page.content.decode('iso-8859-10')
If you don't know, you could use a library like chardet
or Unicode, Dammit
to do the same kind of guessing a browser does.
If you want to force it to just decode to something that you can later write back out in the same way, even if it's going to look like garbage in the mean time, you can use the surrogate-escape
error handler:
site = page.content.decode('utf-8', 'surrogateescape')
# ...
with open('url.txt', 'a', encoding='utf-8', errors='surrogateescape') as file:
file.write(site + '\n')
However, if you're not actually doing anything with the contents, it's probably easier to just keep it as bytes:
site = page.content
# ...
with open('url.txt', 'ab') as file:
file.write(site + b'\n')
Notice that 'ab'
instead of 'a'
, and also that b'\n'
, not '\n'
. If you're leaving bytes as bytes, or encoding strings to bytes, you can't write
them to text files, only to binary files, and you can't add them to strings, only to other bytes. Those seem to be some of the problems you ran into with some of your fix attempts.
Upvotes: 2
Reputation: 347
I think its happen because of Unicode Transformation.
1.Adding the following line to the top of your .py file:
# -*- coding: utf-8 -*-
OR
2.use str.encode('utf8')
function
ex : `site = site.encode('utf8')`
Upvotes: 1
Reputation: 2408
Try this instead
with open('url.txt', 'a',encoding='utf-8') as file:
Upvotes: 7