Cheng Shuai
Cheng Shuai

Reputation: 13

Extract items in XML file and convert it to dict in Python

There is a file called core-site.xml

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/home/centos/hadoop_tmp/tmp</value>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://test:9000</value>
    </property>
</configuration>

How could I get a dict in python like this:

{'hadoop.tmp.dir': 'file:/home/centos/hadoop/tmp', 'fs.defaultFS': 'hdfs://test:9000'}

Upvotes: 0

Views: 1143

Answers (2)

BoboDarph
BoboDarph

Reputation: 2891

The question already has an accepted answer, but since I commented on it, I wanted to give an example of use of the one of the modules I suggested.

xml = '''<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/home/centos/hadoop_tmp/tmp</value>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://test:9000</value>
    </property>
</configuration>'''

import xmltodict
# Load the xml string into a test object
test = xmltodict.parse(xml)
# Instantiate a temporary dictionary where we will store the parsed data
temp_dict = {}
# Time to parse the resulting structure
for name in test:
    # Check that we have the needed 'property' key before doing any processing on the leaf
    if 'property' in test[name].keys():
        # For each property leaf
        for property in test[name]['property']:
                # If the leaf has the stuff you need to save, print it
                if 'name' in property.keys():
                    print('Found name', property['name'])
                if 'value' in property.keys():
                    print('With value', property['value'])
                # And then save it to the temporary dictionary in the form you need
                # Do note that if you have duplicate "name" strings, only the last "value" will be saved
                temp_dict.update({property['name']: property['value']})

print(temp_dict)

And here's the output

Found name hadoop.tmp.dir

With value file:/home/centos/hadoop_tmp/tmp

Found name fs.defaultFS

With value hdfs://test:9000

{'hadoop.tmp.dir':'file:/home/centos/hadoop_tmp/tmp', 'fs.defaultFS':'hdfs://test:9000'}

Upvotes: 0

sktan
sktan

Reputation: 1259

You should use the ElementTree python library which can be found here: https://docs.python.org/2/library/xml.etree.elementtree.html

Firstly, you will need to pass the .xml file into the ElementTree library

import xml.etree.ElementTree as ET
tree = ET.parse('core-site.xml')
root = tree.getroot()

Once done, you can then start using the root object to parse the XML document

for property in root.findall('property'):

Within this loop, you can start extracting names and values from your properties

for entry in root.findall('property'):
    name = entry.find('name').text
    value = entry.find('value').text
    print(name)
    print(value)

You want to add this to a dictionary, which should be as simple as

configuration = dict()
for entry in root.findall('property'):
    name = entry.find('name').text
    value = entry.find('value').text
    configuration[name] = value

Then you should have a dictionary with all your XML configurations inside of it

import xml.etree.ElementTree as ET
tree = ET.parse('core-site.xml')
root = tree.getroot()
configuration = dict()
for entry in root.findall('property'):
    name = entry.find('name').text
    value = entry.find('value').text
    configuration[name] = value
print(configuration)

Upvotes: 2

Related Questions