Reputation: 2702
My script works when I download a english bible. but gives me an ascii error when I download a foreign bible.
python
from BeautifulSoup import BeautifulSoup, Tag, NavigableString
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
print ("downloading and converting Bibles to Aurora...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
url = link.get('href')
name = urlparse.urlparse(url).path.split('/')[-1]
namesave = '%s.html' % '.'.join(name.split('.')[:-1])
chnum = name.split('.')[-2]
dirname = urlparse.urlparse(url).path.split('.')[-1]
try:
f = urllib2.urlopen(url)
except urllib2.URLError:
print "Bad URL or timeout"
continue
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
thearticle = soup.html.body.article
bookname = thearticle['data-book-human']
soup.html.replaceWith('<html>'+str(bookname)+'</html>')
converted = str(soup)
full_path = os.path.join(dirname, namesave)
open(full_path, 'wb').write(converted)
print(name)
print("DOWNLOADS AND CONVERSIONS COMPLETE!")
links.html that works
<a href="http://www.youversion.com/bible/john.6.ceb">http://www.youversion.com/bible/john.6.ceb</a>
links.html that gives error
<a href="http://www.youversion.com/bible/john.6.nav">http://www.youversion.com/bible/john.6.nav</a>
the error
File "test.py", line 32, in <module>
soup.html.replaceWith('<html>'+str(bookname)+'</html>')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
Upvotes: 1
Views: 473
Reputation: 5442
I've seen a similar error before, might even be the same. Can't recall exactly.
Try:
BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES)
Or try to force unicode:
soup.html.replaceWith(u'<html>'+unicode(bookname)+u'</html>')
Upvotes: 2