SaturnsBelt
SaturnsBelt

Reputation: 281

How to parse xml elements to python from a very large xml file?

I am currently working on a program that has 20 or so scripts and can be called from one python file that uses the subprocess library to call these scripts. Each script has 3 parameters in which the user mus currently enter using argparse: the ip address, the username, and the password. These scripts automate the testing of networking devices and such.

Now instead of having the user enter these parameters on the command line, I want to extract these values from an XML file that has about 5,000 lines of code that my company has generated. What is the best way I can extract the info I need so the user doesn't have to manually type in the parameters ?

I have done some research and unfortunately I am not able to understand the best way to do this. Here is a sample excerpt of the xml file:

<sheet>
        <name>7_managementHosts</name>
        <data>
            <name>MgtHosts</name>
            <key>
                <name>Rack U-Location</name>
                <value>U30</value>
                <value>U29</value>
                <value>U28</value>
            </key>
            <key>
                <name>Default Component Name</name>
                <value>sms01</value>
                <value>sms02</value>
                <value>sms03</value>
            </key>
            <key>
                <name>DNS hostname (FQDN)</name>
                <value>sms01.de1000.local</value>
                <value>sms02.de1000.local</value>
                <value>sms03.de1000.local</value>
            </key>
            <key>
                <name>DNS suffix for management interface</name>
                <value>de1000.local</value>
                <value>de1000.local</value>
                <value>de1000.local</value>
            </key>
            <key>
                <name>Keyboard layout</name>
                <value>US Default</value>
                <value>US Default</value>
                <value>US Default</value>
            </key>
            <key>
                <name>root user password</name>
                <value>myPassword</value>
                <value>myPassword</value>
                <value>myPassword</value>
            </key>

It is a really long XML file but the tree is like this and I really don't know the best way to go about this. Thanks for the help !

Upvotes: 0

Views: 95

Answers (2)

balderman
balderman

Reputation: 23825

Using python standard XML lib (And assuming you would like to collect the data under 'key' element)

import xml.etree.ElementTree as ET
import pprint

xml = '''<sheet>
        <name>7_managementHosts</name>
        <data>
            <name>MgtHosts</name>
            <key>
                <name>Rack U-Location</name>
                <value>U30</value>
                <value>U29</value>
                <value>U28</value>
            </key>
            <key>
                <name>Default Component Name</name>
                <value>sms01</value>
                <value>sms02</value>
                <value>sms03</value>
            </key>
            <key>
                <name>DNS hostname (FQDN)</name>
                <value>sms01.de1000.local</value>
                <value>sms02.de1000.local</value>
                <value>sms03.de1000.local</value>
            </key>
            <key>
                <name>DNS suffix for management interface</name>
                <value>de1000.local</value>
                <value>de1000.local</value>
                <value>de1000.local</value>
            </key>
            <key>
                <name>Keyboard layout</name>
                <value>US Default</value>
                <value>US Default</value>
                <value>US Default</value>
            </key>
            <key>
                <name>root user password</name>
                <value>myPassword</value>
                <value>myPassword</value>
                <value>myPassword</value>
            </key>
        </data>
    </sheet>'''

data = {}
root = ET.fromstring(xml)
keys = root.findall('.//data/key')
for key in keys:
    data[key.find('name').text] = [v.text for v in  key.findall('value')]
pprint.pprint(data)

output

{'DNS hostname (FQDN)': ['sms01.de1000.local',
                         'sms02.de1000.local',
                         'sms03.de1000.local'],
 'DNS suffix for management interface': ['de1000.local',
                                         'de1000.local',
                                         'de1000.local'],
 'Default Component Name': ['sms01', 'sms02', 'sms03'],
 'Keyboard layout': ['US Default', 'US Default', 'US Default'],
 'Rack U-Location': ['U30', 'U29', 'U28'],
 'root user password': ['myPassword', 'myPassword', 'myPassword']}

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195573

Example with BeautifulSoup, just to get you started with the module:

data = '''
<sheet>
        <name>7_managementHosts</name>
        <data>
            <name>MgtHosts</name>
            <key>
                <name>Rack U-Location</name>
                <value>U30</value>
                <value>U29</value>
                <value>U28</value>
            </key>
            <key>
                <name>Default Component Name</name>
                <value>sms01</value>
                <value>sms02</value>
                <value>sms03</value>
            </key>
            <key>
                <name>DNS hostname (FQDN)</name>
                <value>sms01.de1000.local</value>
                <value>sms02.de1000.local</value>
                <value>sms03.de1000.local</value>
            </key>
            <key>
                <name>DNS suffix for management interface</name>
                <value>de1000.local</value>
                <value>de1000.local</value>
                <value>de1000.local</value>
            </key>
            <key>
                <name>Keyboard layout</name>
                <value>US Default</value>
                <value>US Default</value>
                <value>US Default</value>
            </key>
            <key>
                <name>root user password</name>
                <value>myPassword</value>
                <value>myPassword</value>
                <value>myPassword</value>
            </key>
 '''

from bs4 import BeautifulSoup

data = BeautifulSoup(data, 'lxml')

parsed = [[v.text for v in key.select('name, value')] for key in data.select('key')]

# just for pretty printing, all the data are in `parsed` variable
from textwrap import shorten
for row_num, row in enumerate(zip(*parsed), 0):
    if row_num == 0:
        print(''.join('{: ^25}'.format(shorten(d, 25)) for d in ['Row Number'] + list(row)))
    else:
        print(''.join('{: ^25}'.format(shorten(d, 25)) for d in [str(row_num)] + list(row)))

Prints:

   Row Number             Rack U-Location      Default Component Name     DNS hostname (FQDN)     DNS suffix for [...]        Keyboard layout        root user password    
        1                       U30                     sms01             sms01.de1000.local          de1000.local              US Default               myPassword        
        2                       U29                     sms02             sms02.de1000.local          de1000.local              US Default               myPassword        
        3                       U28                     sms03             sms03.de1000.local          de1000.local              US Default               myPassword        

Upvotes: 0

Related Questions