Daniel Watkins
Daniel Watkins

Reputation: 1686

Decoding HTML Entities With Python

The following Python code uses BeautifulStoneSoup to fetch the LibraryThing API information for Tolkien's "The Children of Húrin".

import urllib2

from BeautifulSoup import BeautifulStoneSoup

URL = ("http://www.librarything.com/services/rest/1.0/"
            "?method=librarything.ck.getwork&id=1907912"
            "&apikey=2a2e596b887f554db2bbbf3b07ff812a")

soup = BeautifulStoneSoup(urllib2.urlopen(URL),
                          convertEntities=BeautifulStoneSoup.ALL_ENTITIES)
title_field = soup.find('field', attrs={'name': 'canonicaltitle'})
print title_field.find('fact').string

Unfortunately, instead of 'Húrin', it prints out 'Húrin'. This is obviously an encoding issue, but I can't work out what I need to do to get the expected output. Help would be greatly appreciated.

Upvotes: 0

Views: 7339

Answers (2)

sth
sth

Reputation: 229593

In the source of the web page it looks like this: The Children of Húrin. So the encoding is already broken somewhere on their side before it even gets converted to XML...

If it's a general issue with all the books and you need to work around it, this seems to work:

unicode(title_field.find('fact').string).encode("latin1").decode("utf-8")

Upvotes: 4

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 798636

The web page may be lying about its encoding. The output looks like UTF-8. If you got a str at the end then you'll need to decode it as UTF-8. If you have a unicode instead then you'll need to encode as Latin-1 first.

Upvotes: 1

Related Questions