Abhishek Kulkarni
Abhishek Kulkarni

Reputation: 3818

XML Parsing issue in python using xml.etree.ElementTree

I do have following xml generated by some http response

<?xml version="1.0" encoding="UTF-8"?>
<Response rid="1000" status="succeeded" moreData="false">
  <Results completed="true" total="25" matched="5" processed="25">
      <Resource type="h" DisplayName="Host" name="tango">
          <Time start="2011/12/16/18/46/00" end="2011/12/16/19/46/00"/>
             <PerfData attrId="cpuUsage" attrName="Usage">
                <Data intr="5" start="2011/12/16/19" end="2011/12/16/19" data="36.00"/>
                <Data intr="5" start="2011/12/16/19" end="2011/12/16/19" data="86.00"/>
                <Data intr="5" start="2011/12/16/19" end="2011/12/16/19" data="29.00"/>
             </PerfData>
          <Resource type="vm" DisplayName="VM" name="charlie" baseHost="tango">
              <Time start="2011/12/16/18/46/00" end="2011/12/16/19/46/00"/>
              <PerfData attrId="cpuUsage" attrName="Usage">
                 <Data intr="5" start="2011/12/16/19" end="2011/12/16/19" data="6.00"/>
              </PerfData>
          </Resource>
      </Resource>
  </Result>
</Response>

If you look at this carefully - Outer has one more same tag inside that

So high level xml structure is as below

<Resource>
    <Resource>
    </Resource>
</Resource>

Python ElementTree can parse only outer xml ... Below is my code

pattern = re.compile(r'(<Response.*?</Response>)',
                     re.VERBOSE | re.MULTILINE)

for match in pattern.finditer(data):
    contents = match.group(1)
    responses = xml.fromstring(contents)

    for results in responses:
        result = results.tag

        for resources in results:
            resource = resources.tag
            temp = {}
            temp = resources.attrib
            print temp

This shows following output (temp)

{'typeDisplayName': 'Host', 'type': 'h', 'name': 'tango'}

How can I fetch inner attributes?

Upvotes: 0

Views: 296

Answers (1)

Guillaume
Guillaume

Reputation: 10961

Don't parse xml with regular expressions! That won't work, use some xml parsing library instead, lxml for instance:

edit: the code example now fetch top resources only, the loop over them and try to fetch "sub resources", this was made after OP request in comment

from lxml import etree

content = '''
YOUR XML HERE
'''

root = etree.fromstring(content)

# search for all "top level" resources
resources = root.xpath("//Resource[not(ancestor::Resource)]")
for resource in resources:
    # copy resource attributes in a dict
    mashup = dict(resource.attrib)
    # find child resource elements
    subresources = resource.xpath("./Resource")
    # if we find only one resource, add it to the mashup
    if len(subresources) == 1:
        mashup['resource'] = dict(subresources[0].attrib)
    # else... not idea what the OP wants...

    print mashup

That will output:

{'resource': {'DisplayName': 'VM', 'type': 'vm', 'name': 'charlie', 'baseHost': 'tango'}, 'DisplayName': 'Host', 'type': 'h', 'name': 'tango'}

Upvotes: 2

Related Questions