Reputation: 93
I've parsed a web page showing article. I want to save the parsed data into text file, but my python shell shows an error like this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 107: ordinal not in range(128)
and here is a part of my code
search_result = urllib.urlopen(url)
f = search_result.read()
#xml parsing
parsedResult = xml.dom.minidom.parseString(f)
linklist = parsedResult.getElementsByTagName('link') #extracting links
extractedURL = linklist[3].firstChild.nodeValue #pick one link
page = urllib.urlopen(extractedURL).read()
#making html file
g= open('yyyy.html', 'w')
g.write(page)
g.close()
#reading html file and parsing html to get pure text of article
g= open('yyyy.html', 'r')
bs = BeautifulSoup(g,fromEncoding="utf-8")
g.close()
article = bs.find(id="articleBody")
content = article.get_text()
#save as a text file
h= open('yyyy.txt', 'w')
h.write(content)
h.close()
What should I add to make this work?
Upvotes: 0
Views: 140
Reputation: 15930
Try with
import codecs
h = codecs.open('yyyy.txt', 'w', 'utf-8')
or using Python 3.
Upvotes: 1
Reputation: 902
Try to use unidecode:
from unidecode import unidecode
unidecode(page)
Upvotes: 0