How to get inner content as string using minidom from xml.dom?

Question

I have some text tags in my xml file (pdf converted to xml using pdftohtml from popplers-utils) that looks like this:

..part of old large book
with many many pages and some italics text among 'plain' text and more and more text
etc...

and I can get text envolved with text tag with this sample code:

import string
from xml.dom import minidom
xmldoc = minidom.parse('../test/text.xml')
itemlist = xmldoc.getElementsByTagName('text')

some_tag = itemlist[node_index]
output_text = some_tag.firstChild.nodeValue
# if there is all text inside  I can get it by
output_text = some_tag.firstChild.firstChild.nodeValue

# but no if  wrap only one word of the string

but I can not get "nodeValue" if it contents another tag ( or ...) inside and can not get object either

What is the best way to get all text as plain string like javascript innerHTML method or recurse into child tags even if they wraps some words and not entire nodeValue?

thanks

stovfl · Accepted Answer

**Question: How to get inner content as string using minidom

This is a Recursive Solution, for instance:

def getText(nodelist):
    # Iterate all Nodes aggregate TEXT_NODE
    rc = []
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc.append(node.data)
        else:
            # Recursive
            rc.append(getText(node.childNodes))
    return ''.join(rc)


xmldoc = minidom.parse('../test/text.xml')
nodelist = xmldoc.getElementsByTagName('text')

# Iterate ... Node List
for node in nodelist:
    print(getText(node.childNodes))

Output:

..part of old large book
with many many pages and some italics text among 'plain' text and more and more text
etc...

Tested with Python: 3.4.2

How to get inner content as string using minidom from xml.dom?

Answers (2)

Related Questions