Oryon
Oryon

Reputation: 127

Scraping ERROR in Python: 'charmap' codec can't encode character / can't concat str to bytes

I get above ERRORs when I try to scrape some text with Finish-Names from an 'url'. The solutions I tried and corresponding ERRORs, are commented below in the code. I neither know how to fix these, nor what the exact issue is. I'm a beginner in Python. Any help appreciated.

My Code:

from lxml import html
import requests

page = requests.get('url')

site = page.text  # ERROR -> 'charmap' codec can't encode character '\x84' in  
      #  position {x}: character maps to <undefined>
# site = site.encode('utf-8', errors='replace')  # ERROR -> can't concat str to bytes
# site = site.encode('ascii', errors='replace')  # ERROR -> can't concat str to bytes

with open('url.txt', 'a') as file:
    try:
        file.write(site + '\n')
    except Exception as err:
        file.write('an ERROR occured: ' + str(err) + '\n')

and the original Exception:

Traceback (most recent call last):
  File "...\parse.py", line 12, in <module> 
  file.write(site + '\n') File 
"...\python36\lib\encodings\cp1252.py", line 19, in encode return 
codecs.charmap_encode(input,self.errors,encoding_table)[0] 
UnicodeEncodeError: 'charmap' codec can't encode character '\x84' in position 
12591: character maps to <undefined>

regards, Dominik

Upvotes: 2

Views: 17076

Answers (3)

abarnert
abarnert

Reputation: 365767

If the exception is happening on page.text, as you indicate:

When you ask a requests response for its text, it uses the encoding that the page claims to be in. If the page is wrong, that will fail, and usually raise a UnicodeDecodeError.

For debugging problems like this, you should definitely print out what encoding requests got from the server:

print(page.encoding)

A browser will usually just display mojibake. Sometimes, they'll even realize that the encoding is wrong and try to guess at the encoding. They'll rarely fail and refuse to display anything. That makes sense for something designed to display data immediately. It doesn't make sense for many programs designed to process data, or to store data for later (where you want to know there's a problem ASAP, not after you've stored 500GB of useless garbage), etc. That's why requests doesn't try to hard to do magic.

If you know the encoding is, say, Latin-6/IO-8859-10 even though it claims to be something else, you can decode it manually:

site = page.content.decode('iso-8859-10')

If you don't know, you could use a library like chardet or Unicode, Dammit to do the same kind of guessing a browser does.

If you want to force it to just decode to something that you can later write back out in the same way, even if it's going to look like garbage in the mean time, you can use the surrogate-escape error handler:

site = page.content.decode('utf-8', 'surrogateescape')
# ...
with open('url.txt', 'a', encoding='utf-8', errors='surrogateescape') as file:
    file.write(site + '\n')

However, if you're not actually doing anything with the contents, it's probably easier to just keep it as bytes:

site = page.content
# ...
with open('url.txt', 'ab') as file:
    file.write(site + b'\n')

Notice that 'ab' instead of 'a', and also that b'\n', not '\n'. If you're leaving bytes as bytes, or encoding strings to bytes, you can't write them to text files, only to binary files, and you can't add them to strings, only to other bytes. Those seem to be some of the problems you ran into with some of your fix attempts.

Upvotes: 2

Prashant Godhani
Prashant Godhani

Reputation: 347

I think its happen because of Unicode Transformation.

1.Adding the following line to the top of your .py file:

# -*- coding: utf-8 -*-

OR 2.use str.encode('utf8') function

ex : `site = site.encode('utf8')`

Upvotes: 1

Tristo
Tristo

Reputation: 2408

Try this instead

with open('url.txt', 'a',encoding='utf-8') as file:

Upvotes: 7

Related Questions