Reputation: 2015
The following code:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
uClient = uReq('http://www.google.com')
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html.decode('utf-8', 'ignore'), 'lxml')
print(page_soup.find_all('p'))
...produces the following error:
C:\>python ws1.py
Traceback (most recent call last):
File "ws1.py", line 10, in <module>
print(page_soup.find_all('p'))
File "C:\Python34\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in position 40
: character maps to <undefined>
I have searched, in vain, for a solution as every post I have read suggests using a specific encoding none of which has eradicated the problem.
Any help would be appreciated.
Thank you.
Upvotes: 0
Views: 672
Reputation: 308111
You're trying to print a Unicode string that contains characters that can't be represented in the encoding used by your console.
It appears you're using the Windows command line, which means your problem could be solved simply by switching to Python 3.6 - it bypasses the console encoding altogether and sends Unicode straight to Windows.
If that's not possible, you can encode the string yourself and specify that unprintable characters should be replaced with an escape sequence. Then you must decode it again so that print
will work properly.
bstr = page_soup.find_all('p').encode(sys.stdout.encoding, errors='backslashreplace')
print(bstr.decode(sys.stdout.encoding))
Upvotes: 2