Reputation: 1772

python - how to convert html string to utf-8? Getting UnicodeDecodeError errors

I have a script thats looping through a database and doing some beautifulsoup processing on the string along with replacing some text with other text, etc.

This works 100% most of the time, however some html blobs seems to contain unicode text which breaks the script with the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 112: ordinal not in range(128)

I'm not sure what to do in this case, does anyone know of a module / function to force all text in the string to be a standardized utf-8 or something?

All the html blobs in the database came from feedparser (downloading rss feeds, storing in db).

Upvotes: 2

Answers (4)

Michael

Reputation: 7726

Before you do any further processing with your string variable:

clean_str = unicode(str_var_with_strange_coding, errors='ignore')

The messed up characters are skipped. Not elegant, as you don't try to restore any maybe meaningful values, but effective.

Upvotes: 2

Joe

Reputation: 1772

Well after a couple more hours googling, I finally came across a solution that eliminated all decode errors. I'm still fairly new to python (heavy php background) and didn't understand character encoding.

In my code I had a .decode('utf-8') and after that did some .replace(str(beatiful_soup_tag),'') statements. The solution ended up being so simple as to change all str() to unicode(). After that, not a single issue.

Answer found on: http://ubuntuforums.org/showthread.php?t=1212933

I sincerely apologize to the commenters who requested I post the code, what I thought was rock solid and not the issue was quite the opposite and I'm sure they would have caught the issue right away! I'll not make that mistake again! :)

Upvotes: 1

Jakub M.

Reputation: 33817

Make sure you really understand what is the difference between unicode and UTF-8 and that it is not the same (what is a surprise for many). That is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

What is encoding of your DB? Is it really UTF-8 or you only assume that it is? If it contains blobs with with random encodings, then you have problem, because you cannot guess the encoding. When you read from the database, then decode the blob to unicode and use unicode later in your code.

But let assume your base is UTF-8. Then you should use unicode everywhere - decode early, encode late. Use unicode everywhere inside you program, and only decode/encode when you read from or write to the database, display, write to file etc.

Unicode and encoding is a bit pain in Python 2.x, fortunately in python 3 all text is unicode

Regarding BeautifulSoup, use the latest version 4.

Upvotes: 2

Fredrick Brennan

Reputation: 7357

Since you don't want to show us your code, I'm going to give a general answer that hopefully helps you find the problem.

When you first get the data out of the database and fetch it with fetchone, you need to convert it into a unicode object. It is good practice to do this as soon as you have your variable, and then re-encode it only when you output it.

db = MySQLdb.connect()
cur = db.cursor()
cur.execute("SELECT col FROM the_table LIMIT 10")
xml = cur.fetchone()[0].decode('utf-8') # Or whatever encoding the text is in, though we're pretty sure it's utf-8. You might use chardet

After you run xml through BeautifulSoup, you might encode the string again if it is being saved into a file or you might just leave it as a Unicode object if you are re-inserting it into the database.

Upvotes: 1

python - how to convert html string to utf-8? Getting UnicodeDecodeError errors

Answers (4)

Related Questions