Reputation: 1686
The following Python code uses BeautifulStoneSoup to fetch the LibraryThing API information for Tolkien's "The Children of Húrin".
import urllib2
from BeautifulSoup import BeautifulStoneSoup
URL = ("http://www.librarything.com/services/rest/1.0/"
"?method=librarything.ck.getwork&id=1907912"
"&apikey=2a2e596b887f554db2bbbf3b07ff812a")
soup = BeautifulStoneSoup(urllib2.urlopen(URL),
convertEntities=BeautifulStoneSoup.ALL_ENTITIES)
title_field = soup.find('field', attrs={'name': 'canonicaltitle'})
print title_field.find('fact').string
Unfortunately, instead of 'Húrin', it prints out 'Húrin'. This is obviously an encoding issue, but I can't work out what I need to do to get the expected output. Help would be greatly appreciated.
Upvotes: 0
Views: 7339
Reputation: 229593
In the source of the web page it looks like this: The Children of Húrin
. So the encoding is already broken somewhere on their side before it even gets converted to XML...
If it's a general issue with all the books and you need to work around it, this seems to work:
unicode(title_field.find('fact').string).encode("latin1").decode("utf-8")
Upvotes: 4
Reputation: 798636
The web page may be lying about its encoding. The output looks like UTF-8. If you got a str at the end then you'll need to decode it as UTF-8. If you have a unicode instead then you'll need to encode as Latin-1 first.
Upvotes: 1