Reputation: 738
I've searched the site and haven't found an answer that works for me. My problem is that I'm trying to write xml to a file and when I run the script from the terminal I get:
Traceback (most recent call last):
File "fetchWiki.py", line 145, in <module>
pageDictionary = qSQL(users_database)
File "fetchWiki.py", line 107, in qSQL
writeXML(listNS)
File "fetchWiki.py", line 139, in writeXML
f1.write(doc.toprettyxml(indent="\t", encoding="utf-8"))
File "/usr/lib/python2.7/xml/dom/minidom.py", line 57, in toprettyxml
self.writexml(writer, "", indent, newl, encoding)
File "/usr/lib/python2.7/xml/dom/minidom.py", line 1751, in writexml
node.writexml(writer, indent, addindent, newl)
----//---- more lines in here ----//----
self.childNodes[0].writexml(writer, '', '', '')
File "/usr/lib/python2.7/xml/dom/minidom.py", line 1040, in writexml
_write_data(writer, "%s%s%s" % (indent, self.data, newl))
File "/usr/lib/python2.7/xml/dom/minidom.py", line 297, in _write_data
writer.write(data)
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1176: ordinal not
in range(128)
This is from the following code:
doc = Document()
base = doc.createElement('Wiki')
doc.appendChild(base)
for ns_dict in listNamespaces:
namespace = doc.createElement('Namespace')
base.appendChild(namespace)
namespace.setAttribute('NS', ns_dict)
for title in listNamespaces[ns_dict]:
page = doc.createElement('Page')
try:
title.encode('utf8')
page.setAttribute('Title', title)
except:
newTitle = title.decode('latin1', 'ignore')
newTitle.encode('utf8', 'ignore')
page.setAttribute('Title', newTitle)
namespace.appendChild(page)
text = doc.createElement('Content')
text_content = doc.createTextNode(listNamespaces[ns_dict][title])
text.appendChild(text_content)
page.appendChild(text)
f1 = open('pageText.xml', 'w')
f1.write(doc.toprettyxml(indent="\t", encoding="utf-8"))
f1.close()
With or without the encode / decode 'igonore' parameter the error occurs. Adding
# -*- coding: utf-8 -*-
does not help.
I created the python document using Eclipse with Pydoc and it works fine with no problems, but when I run it from the terminal it errors.
Any help is much appreciated including links to answers I did not find.
Thanks.
Upvotes: 2
Views: 2533
Reputation: 1124538
You should not encode the strings you use for attributes. The minidom
library handles those for you when writing.
Your error is caused by mixing bytestrings with unicode data, and your encoded bytestrings are not decodable as ASCII.
If some of your data is encoded, and some of it is in unicode
, try to avoid that situation in the first place. If you cannot avoid having to handle mixed data, do this instead:
page = doc.createElement('Page')
if not isinstance(title, unicode):
title = title.decode('latin1', 'ignore')
page.setAttribute('Title', title)
Note that you don't need to use doc.toprettyxml()
; you can instruct doc.writexml()
to indent your XML for you as well:
import codecs
with codecs.open('pageText.xml', 'w', encoding='utf8') as f1:
doc.writexml(f1, indent='\t', newl='\n')
Upvotes: 7