Reputation: 25
I was running the following code in Python:
import xml.etree.ElementTree as ET
tree = ET.parse('dplp_11.xml')
root = tree.getroot()
f = open('workfile', 'w')
for country in root.findall('article'):
rank = country.find('year').text
name = country.find('title').text
if(int(rank)>2009):
f.write(name)
auth = country.findall('author')
for a in auth:
#print str(a)
f.write(a.text)
f.write(',')
f.write('\n')
I got an error:
Traceback (most recent call last):
File "parser.py", line 14, in <module>
f.write(a.text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 4: ordinal not in range(128)
I was trying to parse the dblp data which looks like this:
<?xml version="1.0"?>
<dblp>
<article mdate="2011-01-11" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>http://dx.doi.org/10.1007/BF03036466</ee>
</article>
<article mdate="2015-07-14" key="journals/acta/BozapalidisFR12">
<author>Symeon Bozapalidis</author>
<author>Zoltán Fülöp 0001</author>
<author>George Rahonis</author>
<title>Equational weighted tree transformations.</title>
<pages>29-52</pages>
<year>2012</year>
<volume>49</volume>
<journal>Acta Inf.</journal>
<number>1</number>
<ee>http://dx.doi.org/10.1007/s00236-011-0148-5</ee>
<url>db/journals/acta/acta49.html#BozapalidisFR12</url>
</article>
</dblp>
How can I resolve it?
Upvotes: 1
Views: 1560
Reputation: 1124448
a.text
is a Unicode object, but you are trying to write it to a plain Python 2 file object:
f.write(a.text)
The f.write()
method only takes a byte string (type str
), triggering an implicit encode to the ASCII codec, triggering your exception if the text can't be encoded as ASCII.
You'll either need to explicitly encode it with a codec that can encode your data, or use a io.open()
file object that does the encoding for you.
Encoding explicitly to UTF-8 would work, for example:
f.write(a.text.encode('utf8'))
or use io.open()
with an explicit encoding:
import io
# ...
f = io.open('workfile', 'w', encoding='utf8')
after which all calls to f.write()
must be Unicode objects; prefix any literal strings with u
:
for a in auth:
f.write(a.text)
f.write(u',')
f.write(u'\n')
Upvotes: 1