SgtMac02
SgtMac02

Reputation: 21

Python issue using xml.dom.minidom Document. Extra empty lines between child elements using toprettyxml()

Please bear with me as I'm VERY new to python (and the greater programming community), but I've been being guided by a coworker with more experience than I. We're trying to write a python script that reads in an XML file and picks apart certain parts of the data, edits some of the variable values and then reassembles the XML. The problem that we're running into is in the way that the data is formatted as it is passed back out to the new do using toprettyxml()

Basically, the top half of the file has a bunch of elements that we don't need to modify at all, so we're trying to just grab those elements in their entirety, then append them back on to the root when we reassemble. Some of the lower elements on the same page at the same level are being picked apart into smaller items in memory and being reassembled at the lowest child levels. The ones that are being manually reassembled and appended are working fine.

So here's what should roughly be the relevant bits of code:

def __handleElemsWithAtrributes(elem):
    #returns empty element with all attributes of source element
    tmpDoc = Document()
    result = tmpDoc.createElement(elem.item(0).tagName)
    attr_map = elem.item(0).attributes
    for i in range(attr_map.length):
        result.setAttribute(attr_map.item(i).name,attr_map.item(i).value)
    return result

def __getWholeElement(elems):
    #returns element with all attributes of source element and all contents
    if len(elems) == 0:
        return 0
    temp = Document()
    for e in elems:
        result = temp.createElement(e.tagName)
        attr_map = e.attributes
        for i in range(attr_map.length):
            result.setAttribute(attr_map.item(i).name,attr_map.item(i).value)
        result = e
    return result


def __init__():
      ##A bunch of other stuff I'm leaving out...
                f = xml.dom.minidom.parse(pathToFile)
                doc = Document()

                modules = f.getElementsByTagName("Module")
                descriptions = f.getElementsByTagName("Description")
                steptree = f.getElementsByTagName("StepTree")
                reference = f.getElementsByTagName("LessonReference")

                mod_val = __handleElemsWithAtrributes(modules)
                des_val = __getWholeElement(descriptions)
                step_val = __getWholeElement(steptree)
                ref_val = __getWholeElement(reference)

                if des_val != 0 and mod_val != 0 and step_val != 0 and ref_val != 0:
                    mod_val.appendChild(des_val)
                    mod_val.appendChild(step_val)
                    mod_val.appendChild(ref_val)
                    doc.appendChild(mod_val)
               o.write(doc.toprettyxml())

No, the tabbing is not accurately preserved here because I copied from several different areas, but I'm sure you get the gist.

Basically, the input I am using looks something like this:

<Module aatribute="" attribte2="" attribute3="" >
<Description>
    <Title>SomeTitle</Title>
    <Objective>An objective</Objective>
    <Action>
        <Familiarize>familiarize text</Familiarize>
    </Action>
    <Condition>
        <Familiarize>Condition text</Familiarize>
    </Condition>
    <Standard>
        <Familiarize>Standard text</Familiarize>
    </Standard>
    <PerformanceMeasures>
        <Measure>COL text</Measure>
    </PerformanceMeasures>
    <TMReferences>
        <Reference>Reference text</Reference> 
    </TMReferences>
</Description>

And then when it's reassembled, it comes out looking something like this:

<Module aatribute="" attribte2="" attribute3="" >
<Description>


    <Title>SomeTitle</Title>


    <Objective>An objective</Objective>


    <Action>


        <Familiarize>familiarize text</Familiarize>


    </Action>


    <Condition>


        <Familiarize>Condition text</Familiarize>


    </Condition>


    <Standard>


        <Familiarize>Standard text</Familiarize>


    </Standard>


    <PerformanceMeasures>


        <Measure>COL text</Measure>


    </PerformanceMeasures>


    <TMReferences>


        <Reference>Reference text</Reference> 


    </TMReferences>


</Description>

How do I get it to stop making all of the extra empty lines? Any ideas?

Upvotes: 2

Views: 2368

Answers (2)

frank schmidt
frank schmidt

Reputation: 121

Thank you this works recursively!!

def cleanUpNodes(self,nodes):
        for node in nodes.childNodes:
            if node.nodeType == node.TEXT_NODE and (node.data.startswith('\t') or node.data.startswith('\n') or node.data.startswith('\r') ):
                node.data = ''
            if node.nodeType == node.ELEMENT_NODE:
                self.cleanUpNodes(node)
        nodes.normalize()

Upvotes: -1

Lilley
Lilley

Reputation: 234

I have the same issue. The thing is, every time Python jumps a line, it adds a textNode in your tree for it. Hence, topprettyxml() is a very vicious function because it adds node to your tree without you being aware of it.

One of the solutions would be to find a way to erase all the useless textNodes when you parse your file at the beginning (I'm looking for it right now, still haven't found a "pretty" solution).

Deleting node by node:

def cleanUpNodes(nodes):
    for node in nodes.childNodes:
        if node.nodeType == Node.TEXT_NODE:
            node.data = ''
    nodes.normalize()

from http://mail.python.org/pipermail/xml-sig/2004-March/010191.html

Upvotes: 3

Related Questions