Blainer
Blainer

Reputation: 2702

python beautiful soup ascii error

My script works when I download a english bible. but gives me an ascii error when I download a foreign bible.

python

from BeautifulSoup import BeautifulSoup, Tag, NavigableString
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
print ("downloading and converting Bibles to Aurora...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
  url = link.get('href')
  name = urlparse.urlparse(url).path.split('/')[-1]
  namesave = '%s.html' % '.'.join(name.split('.')[:-1])
  chnum = name.split('.')[-2]
  dirname = urlparse.urlparse(url).path.split('.')[-1]
  try:
      f = urllib2.urlopen(url)
  except urllib2.URLError:
      print "Bad URL or timeout"
      continue
  s = f.read()
  if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname)
  soup = BeautifulSoup(s)
  thearticle = soup.html.body.article
  bookname = thearticle['data-book-human']
  soup.html.replaceWith('<html>'+str(bookname)+'</html>')
  converted = str(soup)
  full_path = os.path.join(dirname, namesave)
  open(full_path, 'wb').write(converted)
  print(name)
print("DOWNLOADS AND CONVERSIONS COMPLETE!")

links.html that works

<a href="http://www.youversion.com/bible/john.6.ceb">http://www.youversion.com/bible/john.6.ceb</a>

links.html that gives error

<a href="http://www.youversion.com/bible/john.6.nav">http://www.youversion.com/bible/john.6.nav</a>

the error

  File "test.py", line 32, in <module>
    soup.html.replaceWith('<html>'+str(bookname)+'</html>')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)

Upvotes: 1

Views: 473

Answers (1)

Jonas Geiregat
Jonas Geiregat

Reputation: 5442

I've seen a similar error before, might even be the same. Can't recall exactly.

Try:

BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES)

Or try to force unicode:

soup.html.replaceWith(u'<html>'+unicode(bookname)+u'</html>')

Upvotes: 2

Related Questions