In Python, Parsing Custom XML Tags Without Parsing HTML

Question

I'm new to Python 2.7, and I'm trying to parse an XML file that contains HTML. I want to parse the custom XML tags without parsing any HTML content whatsoever. What's the best way to do this? (If it's helpful, my list of custom XML tags is small, so if there's an XML parser that has an option to only parse specified tags that would probably work fine.)

E.g. I have an XML file that looks like


  
    My what a lovely day.

I'd like to be able to parse apart everything except the HTML, and in particular to extract the value of myTag2 as un-parsed HTML.

EDIT: Here's more information to answer a question below. I had previously tried using ElementTree. This is what happened:

root = ET.fromstring(xmlstring)
root.tag  # returns 'myTag1'
root[0].tag  # returns 'myTag2'
root[0].text  # returns None, but I want it to return the HTML string

The HTML string I want has been parsed and is stored as a tag and text:

root[0][0].tag  # returns 'p', but I don't even want root[0][0] to exist
root[0][0].text  # returns 'My ... day.'

But really I'd like to be able to do something like this...

root[0].unparsedtext  # returns 'My ... day.'

SOLUTION:

har07's answer works great. I modified that code slightly to account for an edge case. Here's what I'm implementing:

def _getInner(element):
    if element.text == None:
        textStr = ''
    else:
        textStr = element.text
    return textStr + ''.join(ET.tostring(e) for e in element)

Then if

element = ET.fromstring('Let us be gratuitous with tags')

the original code will only return the text starting with the first XML-formatted tag, but the modified version will capture the desired text:

''.join(ET.tostring(e) for e in element)  # returns 'gratuitous with tags'

_getInner(element)  # returns 'Let us be gratuitous with tags'

har07 · Accepted Answer

I don't think there is an easy way to modify an XML parser behavior to ignore some predefined tags. A much easier way would be to let the parser normally parse the XML, then you can create a function that return unparsed content of an element for this purpose, for example :

import xml.etree.ElementTree as ET

def getUnparsedContent(element):
    return ''.join(ET.tostring(e) for e in element)

xmlstring = """
  
    My what a lovely day.
  
"""

root = ET.fromstring(xmlstring)
print(getUnparsedContent(root[0]))

output :

My what a lovely day.

In Python, Parsing Custom XML Tags Without Parsing HTML

Answers (2)

Related Questions