jack-all-trades
jack-all-trades

Reputation: 282

stuck with encodings in python with BeautifulSoup

The page is encoded in UTF-8 and with python's HTMLParser it works well, no UnicodeDecodeError, but I do get an error when I try to parse it with BeautifulSoup. I've tried _*_ coding: utf-8 _*_, .encode('utf-8') everywhere and am still getting the error

import urllib
from BeautifulSoup import BeautifulSoup
args=urllib.urlencode({'keywords':'magic'})
doc=urllib.urlopen('http://www.example.com/submit', args)
soup=BeautifulSoup(doc)
stuff = soup.findAll('section',id='banner')
print stuff

Traceback (most recent call last):
      File "test.py", line 7, in <module>
        print stuff
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 112: ordinal not in range(128)

Upvotes: 2

Views: 1486

Answers (2)

Alastair McCormack
Alastair McCormack

Reputation: 27704

You shouldn't be getting UnicodeEncodeError: 'ascii'.. errors when you print. This is often caused if your locale is corrupt or set to C. Python is then unable to set an appropriate encoder on the stdout stream.

Run locale and check for errors or warnings.

If you can't fix your locale, you can often override Python's stdout encoder with by setting PYTHONIOENCODING in your environment to an encoding that matches your terminal emulation. Often you can get by with:

export PYTHONIOENCODING=UTF-8

or

PYTHONIOENCODING=UTF-8 python my_script.py

Upvotes: 0

jack-all-trades
jack-all-trades

Reputation: 282

Ok i found the solution in my last try, maybe it will help others with the same problem. It needs to be encoded, not decoded

print( [e.encode('utf-8', 'ignore') for e in stuff] )

Upvotes: 3

Related Questions