Parsing a complicated XML file with Python

Question

I am trying to parse a very ugly XML file with Python. I manage to get pretty well into it, but at the npdoc element it fails. What am I doing wrong?

XML:





    
        
            
                
                    Lorem ipsum some random text here.
                    
                        Yes this is HTML markup, and I would like to keep that.
                    
                
                
                    I am a headline
                
                
                    I am some other text

This is the python code I have so far:

from xml.etree.ElementTree import ElementTree

def parse(self):
    tree = ElementTree(file=filename)

    for item in tree.iter("article"):
        articleParts = item.find("articleparts")
        for articlepart in articleParts.iter("articlepart"):
            data = articlepart.find("data")
            npdoc = data.find("npdoc")

            id = item.get("id")
            headline = npdoc.find("headline").text
            leadIn = npdoc.find("leadin").text
            body = npdoc.find("body").text


    return articles

What happens is that I get the id out, but the fields that are inside the npdoc element I cannot access. The npdoc variable gets set to None.

Update: Managed to get the elements into variables by using the namespace in the .find() calls. How do I get the value? As it is HTML it does not come out correctly with the .text attribute.

Gord Thompson · Accepted Answer

This is what I came up with in Python 3.4. It's certainly not bulletproof, but it might give you some ideas.

import xml.etree.ElementTree as ET
tree = ET.parse(r'C:\Users\Gord\Desktop
asty.xml')
npexchange = tree.getroot()
for article in npexchange:
    for articleparts in article:
        for articlepart in articleparts:
            id = articlepart.attrib['id']
            print("ArticlePart - id: {0}".format(id))
            for data in articlepart:
                for npdoc in data:
                    for child in npdoc:
                        tag = child.tag[child.tag.find('}')+1:]
                        print("    {0}:".format(tag))  ## e.g., "body:"
                        contents = ET.tostring(child).decode('utf-8')
                        contents = contents.replace('', '>')
                        contents = contents.replace('<' + tag + '>
', '')
                        contents = contents.replace('', '')
                        contents = contents.strip()
                        print("        {0}".format(contents))

The console output is

ArticlePart - id: 1234
    body:
        Lorem ipsum some random text here.
                            
                                Yes this is HTML markup, and I would like to keep that.
                            
    headline:
        I am a headline
    leadin:
        I am some other text

Update

Somewhat improved version with

a Namespace map (as suggested by Charles),
register_namespace with an empty prefix to remove some namespace prefix "noise", and
using .findall() instead of blindly iterating through child nodes regardless of their tag:

import xml.etree.ElementTree as ET
npdoc_uri = 'http://www.example.com/npdoc/2.1'
nsmap = {
    'npexchange': 'http://www.example.com/npexchange/3.5',
    'npdoc': npdoc_uri
    }
ET.register_namespace("", npdoc_uri)
tree = ET.parse(r'/home/gord/Desktop/nasty.xml')
npexchange = tree.getroot()
for article in npexchange.findall('npexchange:article', nsmap):
    for articleparts in article.findall('npexchange:articleparts', nsmap):
        for articlepart in articleparts.findall('npexchange:articlepart', nsmap):
            id = articlepart.attrib['id']
            print("ArticlePart - id: {0}".format(id))
            for data in articlepart.findall('npexchange:data', nsmap):
                for npdoc in data.findall('npdoc:npdoc', nsmap):
                    for child in npdoc.getchildren():
                        tag = child.tag[child.tag.find('}')+1:]
                        print("    {0}:".format(tag))  ## e.g., "body:"
                        contents = ET.tostring(child).decode('utf-8')
                        # remove HTML block tags, e.g.  and 
                        contents = contents.replace('<' + tag + ' xmlns="' + npdoc_uri + '">
', '')
                        contents = contents.replace('', '')
                        contents = contents.strip()
                        print("        {0}".format(contents))

Parsing a complicated XML file with Python

Answers (2)

Related Questions