Reputation: 43
I am trying to replicate the example from this tutorial, but using iterparse with elem.clear().
XML example:
<?xml version="1.0" encoding="UTF-8"?>
<scenario>
<world>
<region name="USA">
<AgSupplySector name="Corn" nocreate="1">
<AgSupplySubsector name="Corn_NelsonR" nocreate="1">
<AgProductionTechnology name="Corn_NelsonR" nocreate="1">
<period year="1975">
<Non-CO2 name="SO2_1_AWB">
<input-emissions>3.98749e-05</input-emissions>
<output-driver/>
<gdp-control name="GDP_control">
<max-reduction>60</max-reduction>
<steepness>3.5</steepness>
</gdp-control>
</Non-CO2>
<Non-CO2 name="NOx_AWB">
<input-emissions>0.000285263</input-emissions>
<output-driver/>
<gdp-control name="GDP_control">
<max-reduction>60</max-reduction>
<steepness>3.5</steepness>
</gdp-control>
</Non-CO2>
</period>
</AgProductionTechnology>
</AgSupplySubsector>
</AgSupplySector>
</region>
</world>
</scenario>
The output is expected like this: I am trying to parse it using the following code:
import os
import xml.etree.cElementTree as etree
import codecs
import csv
PATH = 'D:\Book1'
FILENAME_BIO = 'Test.csv'
FILENAME_XML = 'all_aglu_emissions.xml'
ENCODING = "utf-8"
pathBIO = os.path.join(PATH, FILENAME_BIO)
pathXML = os.path.join(PATH, FILENAME_XML)
with codecs.open(pathBIO, "w", ENCODING) as bioFH:
bioWriter = csv.writer(bioFH, quoting=csv.QUOTE_MINIMAL)
bioWriter.writerow(['Year','Gas', 'Value','Technology','Crop','Country'])
for event, elem in etree.iterparse(pathXML, events=('start','end')):
if event == 'start' and elem.tag == 'region':
str1 = elem.attrib['name']
elif event == 'start' and elem.tag == 'AgSupplySector':
str2 = elem.attrib['name']
elif event == 'start' and elem.tag == 'AgProductionTechnology':
str3 = elem.attrib['name']
elif event == 'start' and elem.tag == 'period':
str4 = elem.attrib['year']
elif event == 'start' and elem.tag == 'Non-CO2':
str5 = elem.attrib['name']
elif event == 'end' and elem.tag == 'input-emissions':
for em in elem.iter('input-emissions'):
str6 = em.text
bioWriter.writerow([str4, str5, str6, str3, str2, str1])
elem.clear()
My issue(s) here is that I got more extra lines with empty fields for str6. Probably, I have nesting problem here. Please help. Error example (0 fields appear):
Upvotes: 0
Views: 339
Reputation: 338108
The for em in elem.iter('input-emissions')
loop is useless, drop it.
import os
import xml.etree.ElementTree as etree
import csv
PATH = '.'
FILENAME_BIO = 'Test.csv'
FILENAME_XML = 'all_aglu_emissions.xml'
pathBIO = os.path.join(PATH, FILENAME_BIO)
pathXML = os.path.join(PATH, FILENAME_XML)
with open(pathBIO, 'w', encoding='utf8', newline='') as bioFH:
bioWriter = csv.writer(bioFH, quoting=csv.QUOTE_MINIMAL)
bioWriter.writerow('Year Gas Value Technology Crop Country'.split())
for event, elem in etree.iterparse(pathXML, events=('start',)):
if elem.tag == 'region':
str1 = elem.attrib['name']
elif elem.tag == 'AgSupplySector':
str2 = elem.attrib['name']
elif elem.tag == 'AgProductionTechnology':
str3 = elem.attrib['name']
elif elem.tag == 'period':
str4 = elem.attrib['year']
elif elem.tag == 'Non-CO2':
str5 = elem.attrib['name']
elif elem.tag == 'input-emissions':
str6 = elem.text
bioWriter.writerow([str4, str5, str6, str3, str2, str1])
elem.clear()
There are some other subtle changes I made to the code, since I assume you're using Python 3 for this. They include using xml.etree.ElementTree
instead of the obsolete xml.etree.cElementTree
, skipping the codecs
module (Python 3 can do that natively) and passing the newline=''
parameter to the open()
call, so the csv
module can handle newlines correctly by itself.
Since listening to the start
event is enough for the desired effect, I've dropped handling the end
event entirely.
The result is
Year,Gas,Value,Technology,Crop,Country 1975,SO2_1_AWB,3.98749e-05,Corn_NelsonR,Corn,USA 1975,NOx_AWB,0.000285263,Corn_NelsonR,Corn,USA
Upvotes: 2