Bex
Bex

Reputation: 43

How to apply xmlTree iterparse to nested XML set

I am trying to replicate the example from this tutorial, but using iterparse with elem.clear().

XML example:

<?xml version="1.0" encoding="UTF-8"?>
<scenario>
    <world>
        <region name="USA">
            <AgSupplySector name="Corn" nocreate="1">
                <AgSupplySubsector name="Corn_NelsonR" nocreate="1">
                    <AgProductionTechnology name="Corn_NelsonR" nocreate="1">
                        <period year="1975">
                            <Non-CO2 name="SO2_1_AWB">
                                <input-emissions>3.98749e-05</input-emissions>
                                <output-driver/>
                                <gdp-control name="GDP_control">
                                    <max-reduction>60</max-reduction>
                                    <steepness>3.5</steepness>
                                </gdp-control>
                            </Non-CO2>
                            <Non-CO2 name="NOx_AWB">
                                <input-emissions>0.000285263</input-emissions>
                                <output-driver/>
                                <gdp-control name="GDP_control">
                                    <max-reduction>60</max-reduction>
                                    <steepness>3.5</steepness>
                                </gdp-control>
                            </Non-CO2>
                        </period>
                    </AgProductionTechnology>
                </AgSupplySubsector>
            </AgSupplySector>
        </region>
    </world>
</scenario>                         

The output is expected like this: table I am trying to parse it using the following code:

import os
import xml.etree.cElementTree as etree
import codecs
import csv

PATH = 'D:\Book1'
FILENAME_BIO = 'Test.csv'
FILENAME_XML = 'all_aglu_emissions.xml'
ENCODING = "utf-8"


pathBIO = os.path.join(PATH, FILENAME_BIO)
pathXML = os.path.join(PATH, FILENAME_XML)

with codecs.open(pathBIO, "w", ENCODING) as bioFH:
    bioWriter = csv.writer(bioFH, quoting=csv.QUOTE_MINIMAL)
    bioWriter.writerow(['Year','Gas', 'Value','Technology','Crop','Country'])

    for event, elem in etree.iterparse(pathXML, events=('start','end')):
        if event == 'start' and elem.tag == 'region':
            str1 = elem.attrib['name']
        elif event == 'start' and elem.tag == 'AgSupplySector':
            str2 = elem.attrib['name']
        elif event == 'start' and elem.tag == 'AgProductionTechnology':
            str3 = elem.attrib['name']
        elif event == 'start' and elem.tag == 'period':
            str4 = elem.attrib['year']
        elif event == 'start' and elem.tag == 'Non-CO2':
            str5 = elem.attrib['name']
        elif event == 'end' and elem.tag == 'input-emissions':
            for em in elem.iter('input-emissions'):
                str6 = em.text
                bioWriter.writerow([str4, str5, str6, str3, str2, str1])
            
            elem.clear()

My issue(s) here is that I got more extra lines with empty fields for str6. Probably, I have nesting problem here. Please help. Error example (0 fields appear): enter image description here

Upvotes: 0

Views: 339

Answers (1)

Tomalak
Tomalak

Reputation: 338108

The for em in elem.iter('input-emissions') loop is useless, drop it.

import os
import xml.etree.ElementTree as etree
import csv

PATH = '.'
FILENAME_BIO = 'Test.csv'
FILENAME_XML = 'all_aglu_emissions.xml'


pathBIO = os.path.join(PATH, FILENAME_BIO)
pathXML = os.path.join(PATH, FILENAME_XML)

with open(pathBIO, 'w', encoding='utf8', newline='') as bioFH:
    bioWriter = csv.writer(bioFH, quoting=csv.QUOTE_MINIMAL)
    bioWriter.writerow('Year Gas Value Technology Crop Country'.split())

    for event, elem in etree.iterparse(pathXML, events=('start',)):
        if elem.tag == 'region':
            str1 = elem.attrib['name']
        elif elem.tag == 'AgSupplySector':
            str2 = elem.attrib['name']
        elif elem.tag == 'AgProductionTechnology':
            str3 = elem.attrib['name']
        elif elem.tag == 'period':
            str4 = elem.attrib['year']
        elif elem.tag == 'Non-CO2':
            str5 = elem.attrib['name']
        elif elem.tag == 'input-emissions':
            str6 = elem.text
            bioWriter.writerow([str4, str5, str6, str3, str2, str1])
        elem.clear()

There are some other subtle changes I made to the code, since I assume you're using Python 3 for this. They include using xml.etree.ElementTree instead of the obsolete xml.etree.cElementTree, skipping the codecs module (Python 3 can do that natively) and passing the newline='' parameter to the open() call, so the csv module can handle newlines correctly by itself.

Since listening to the start event is enough for the desired effect, I've dropped handling the end event entirely.

The result is

Year,Gas,Value,Technology,Crop,Country
1975,SO2_1_AWB,3.98749e-05,Corn_NelsonR,Corn,USA
1975,NOx_AWB,0.000285263,Corn_NelsonR,Corn,USA

Upvotes: 2

Related Questions