Dmitry
Dmitry

Reputation: 113

How to get inner content as string using minidom from xml.dom?

I have some text tags in my xml file (pdf converted to xml using pdftohtml from popplers-utils) that looks like this:

<text top="525" left="170" width="603" height="16" font="1">..part of old large book</text>
<text top="546" left="128" width="645" height="16" font="1">with many many pages and some <i>italics text among 'plain' text</i> and more and more text</text>
<text top="566" left="128" width="642" height="16" font="1">etc...</text>

and I can get text envolved with text tag with this sample code:

import string
from xml.dom import minidom
xmldoc = minidom.parse('../test/text.xml')
itemlist = xmldoc.getElementsByTagName('text')

some_tag = itemlist[node_index]
output_text = some_tag.firstChild.nodeValue
# if there is all text inside <i> I can get it by
output_text = some_tag.firstChild.firstChild.nodeValue

# but no if <i></i> wrap only one word of the string

but I can not get "nodeValue" if it contents another tag (<i> or <b>...) inside and can not get object either

What is the best way to get all text as plain string like javascript innerHTML method or recurse into child tags even if they wraps some words and not entire nodeValue?

thanks

Upvotes: 2

Views: 5748

Answers (2)

Mark Manyen
Mark Manyen

Reputation: 91

Way too late to the party... I had a similar problem except I wanted the tags in the resulting string. Here is my solution:

# Reconstruct this element's body XML from dom nodes
def getChildXML(elem):
    out = ""
    for c in elem.childNodes:
        if c.nodeType == minidom.Node.TEXT_NODE:
            out += c.nodeValue
        else:
            if c.nodeType == minidom.Node.ELEMENT_NODE:
                if c.childNodes.length == 0:
                    out += "<" + c.nodeName + "/>"
                else:
                    out += "<" + c.nodeName + ">"
                    cs = ""
                    cs = getChildXML(c)
                    out += cs
                    out += "</" + c.nodeName + ">"
    return out

This should return the exact XML with tags included.

Upvotes: 2

stovfl
stovfl

Reputation: 15533

**Question: How to get inner content as string using minidom

This is a Recursive Solution, for instance:

def getText(nodelist):
    # Iterate all Nodes aggregate TEXT_NODE
    rc = []
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc.append(node.data)
        else:
            # Recursive
            rc.append(getText(node.childNodes))
    return ''.join(rc)


xmldoc = minidom.parse('../test/text.xml')
nodelist = xmldoc.getElementsByTagName('text')

# Iterate <text ..>...</text> Node List
for node in nodelist:
    print(getText(node.childNodes))

Output:

..part of old large book
with many many pages and some italics text among 'plain' text and more and more text
etc...

Tested with Python: 3.4.2

Upvotes: 2

Related Questions