Reputation: 21
Please bear with me as I'm VERY new to python (and the greater programming community), but I've been being guided by a coworker with more experience than I. We're trying to write a python script that reads in an XML file and picks apart certain parts of the data, edits some of the variable values and then reassembles the XML. The problem that we're running into is in the way that the data is formatted as it is passed back out to the new do using toprettyxml()
Basically, the top half of the file has a bunch of elements that we don't need to modify at all, so we're trying to just grab those elements in their entirety, then append them back on to the root when we reassemble. Some of the lower elements on the same page at the same level are being picked apart into smaller items in memory and being reassembled at the lowest child levels. The ones that are being manually reassembled and appended are working fine.
So here's what should roughly be the relevant bits of code:
def __handleElemsWithAtrributes(elem):
#returns empty element with all attributes of source element
tmpDoc = Document()
result = tmpDoc.createElement(elem.item(0).tagName)
attr_map = elem.item(0).attributes
for i in range(attr_map.length):
result.setAttribute(attr_map.item(i).name,attr_map.item(i).value)
return result
def __getWholeElement(elems):
#returns element with all attributes of source element and all contents
if len(elems) == 0:
return 0
temp = Document()
for e in elems:
result = temp.createElement(e.tagName)
attr_map = e.attributes
for i in range(attr_map.length):
result.setAttribute(attr_map.item(i).name,attr_map.item(i).value)
result = e
return result
def __init__():
##A bunch of other stuff I'm leaving out...
f = xml.dom.minidom.parse(pathToFile)
doc = Document()
modules = f.getElementsByTagName("Module")
descriptions = f.getElementsByTagName("Description")
steptree = f.getElementsByTagName("StepTree")
reference = f.getElementsByTagName("LessonReference")
mod_val = __handleElemsWithAtrributes(modules)
des_val = __getWholeElement(descriptions)
step_val = __getWholeElement(steptree)
ref_val = __getWholeElement(reference)
if des_val != 0 and mod_val != 0 and step_val != 0 and ref_val != 0:
mod_val.appendChild(des_val)
mod_val.appendChild(step_val)
mod_val.appendChild(ref_val)
doc.appendChild(mod_val)
o.write(doc.toprettyxml())
No, the tabbing is not accurately preserved here because I copied from several different areas, but I'm sure you get the gist.
Basically, the input I am using looks something like this:
<Module aatribute="" attribte2="" attribute3="" >
<Description>
<Title>SomeTitle</Title>
<Objective>An objective</Objective>
<Action>
<Familiarize>familiarize text</Familiarize>
</Action>
<Condition>
<Familiarize>Condition text</Familiarize>
</Condition>
<Standard>
<Familiarize>Standard text</Familiarize>
</Standard>
<PerformanceMeasures>
<Measure>COL text</Measure>
</PerformanceMeasures>
<TMReferences>
<Reference>Reference text</Reference>
</TMReferences>
</Description>
And then when it's reassembled, it comes out looking something like this:
<Module aatribute="" attribte2="" attribute3="" >
<Description>
<Title>SomeTitle</Title>
<Objective>An objective</Objective>
<Action>
<Familiarize>familiarize text</Familiarize>
</Action>
<Condition>
<Familiarize>Condition text</Familiarize>
</Condition>
<Standard>
<Familiarize>Standard text</Familiarize>
</Standard>
<PerformanceMeasures>
<Measure>COL text</Measure>
</PerformanceMeasures>
<TMReferences>
<Reference>Reference text</Reference>
</TMReferences>
</Description>
How do I get it to stop making all of the extra empty lines? Any ideas?
Upvotes: 2
Views: 2368
Reputation: 121
Thank you this works recursively!!
def cleanUpNodes(self,nodes):
for node in nodes.childNodes:
if node.nodeType == node.TEXT_NODE and (node.data.startswith('\t') or node.data.startswith('\n') or node.data.startswith('\r') ):
node.data = ''
if node.nodeType == node.ELEMENT_NODE:
self.cleanUpNodes(node)
nodes.normalize()
Upvotes: -1
Reputation: 234
I have the same issue.
The thing is, every time Python jumps a line, it adds a textNode in your tree for it.
Hence, topprettyxml()
is a very vicious function because it adds node to your tree without you being aware of it.
One of the solutions would be to find a way to erase all the useless textNodes when you parse your file at the beginning (I'm looking for it right now, still haven't found a "pretty" solution).
Deleting node by node:
def cleanUpNodes(nodes):
for node in nodes.childNodes:
if node.nodeType == Node.TEXT_NODE:
node.data = ''
nodes.normalize()
from http://mail.python.org/pipermail/xml-sig/2004-March/010191.html
Upvotes: 3