Arnab
Arnab

Reputation: 1037

Parsing XML in Python using the cElementTree module

I have an XML file, which I wanted to convert to a dictionary. I have tried to write the following code but the output is not as expected. I have the following XML file named core-site.xml:

<configuration>
    <property>
    <name>hadoop.tmp.dir</name>
    <value>/home/hdfs/tmp</value>
    <description>Temporary Directory.</description>
    </property>

    <property>
    <name>fs.defaultFS</name>
    <value>hdfs://192.XXX.X.XXX:XXXX</value>
    <description>Use HDFS as file storage engine</description>
    </property>
</configuration>

The code that I wrote is:

import xml.etree.cElementTree
import xml.etree.ElementTree as ET
import warnings

warnings.filterwarnings("ignore")

class XmlListConfig(list):
    def __init__(self, aList):
        for element in aList:
            if element:
                # treat like dict
                if len(element) == 1 or element[0].tag != element[1].tag:
                    self.append(XmlDictConfig(element))
                # treat like list
                elif element[0].tag == element[1].tag:
                    self.append(XmlListConfig(element))
            elif element.text:
                text = element.text.strip()
                if text:
                    self.append(text)


class XmlDictConfig(dict):
    def __init__(self, parent_element):
        if parent_element.items():
            self.update(dict(parent_element.items()))
        for element in parent_element:
            if element:
                # treat like dict - we assume that if the first two tags
                # in a series are different, then they are all different.
                if len(element) == 1 or element[0].tag != element[1].tag:
                    aDict = XmlDictConfig(element)
                # treat like list - we assume that if the first two tags
                # in a series are the same, then the rest are the same.
                else:
                    # here, we put the list in dictionary; the key is the
                    # tag name the list elements all share in common, and
                    # the value is the list itself 
                    aDict = {element[0].tag: XmlListConfig(element)}
                # if the tag has attributes, add those to the dict
                if element.items():
                    aDict.update(dict(element.items()))
                self.update({element.tag: aDict})
            # this assumes that if you've got an attribute in a tag,
            # you won't be having any text. This may or may not be a 
            # good idea -- time will tell. It works for the way we are
            # currently doing XML configuration files...
            elif element.items():
                self.update({element.tag: dict(element.items())})
            # finally, if there are no child tags and no attributes, extract
            # the text
            else:
                self.update({element.tag: element.text})

tree = ET.parse('core-site.xml')
root = tree.getroot()
xmldict = XmlDictConfig(root)
print xmldict

This is the output that I am getting:

{
    'property': 
    {
        'name': 'fs.defaultFS', 
        'value': 'hdfs://192.X.X.X:XXXX', 
        'description': 'Use HDFS as file storage engine'
    }
}

Why isn't the first property tag being shown? It only shows the data in the last property tag.

Upvotes: 0

Views: 448

Answers (1)

PW.
PW.

Reputation: 3725

Since you are using a dict, the second element with the same key property replaces the first element previously recorded in the dict.
You have to use a different data structure, a list of dict for instance.

Upvotes: 2

Related Questions