DevC
DevC

Reputation: 7423

Python how to strip white-spaces from xml text nodes

I have a xml file as follows

<Person>
<name>

 My Name

</name>
<Address>My Address</Address>
</Person>

The tag has extra new lines, Is there any quick Pythonic way to trim this and generate a new xml.

I found this but it trims only which are between tags not the value https://skyl.org/log/post/skyl/2010/04/remove-insignificant-whitespace-from-xml-string-with-python/

Update 1 - Handle following xml which has tail spaces in <name> tag

<Person>
<name>

 My Name<shortname>My</short>

</name>
<Address>My Address</Address>
</Person>

Accepted answer handle above both kind of xml's

Update 2 - I have posted my version in answer below, I am using it to remove all kind of whitespaces and generate pretty xml in file with xml encodings

https://stackoverflow.com/a/19396130/973699

Upvotes: 5

Views: 12642

Answers (5)

user7851115
user7851115

Reputation:

I'm working with an older version of Python (2.3), and I'm currently stuck with the standard library. To show an answer that's greatly backwards compatible, I've written this with xml.dom and xml.minidom functions.

import codecs
from xml.dom import minidom

# Read in the file to a DOM data structure.
original_document = minidom.parse("original_document.xml")

# Open a UTF-8 encoded file, because it's fairly standard for XML.
stripped_file = codecs.open("stripped_document.xml", "w", encoding="utf8")

# Tell minidom to format the child text nodes without any extra whitespace.
original_document.writexml(stripped_file, indent="", addindent="", newl="")

stripped_file.close()

While it's not BeautifulSoup, this solution is pretty elegant and uses the full force of the lower-level API. Note that the actual formatting is just one line :)

Documentation of API calls used here:

Upvotes: 1

DevC
DevC

Reputation: 7423

Accepted answer given by Birei using lxml does the job perfectly, but I wanted to trim all kind of white/blank space, blank lines and regenerate pretty xml in a xml file.

Following code did what I wanted

from lxml import etree

#discard strings which are entirely white spaces
myparser = etree.XMLParser(remove_blank_text=True)

root = etree.parse('xmlfile',myparser)

#from Birei's answer 
for elem in root.iter('*'):
    if elem.text is not None:
        elem.text = elem.text.strip()
    if elem.tail is not None:
        elem.tail = elem.tail.strip()

#write the xml file with pretty print and xml encoding
root.write('xmlfile', pretty_print=True, encoding="utf-8", xml_declaration=True)

Upvotes: 3

Birei
Birei

Reputation: 36262

With lxml you can iterate over all elements and check if it has text to strip():

from lxml import etree

tree = etree.parse('xmlfile')
root = tree.getroot()

for elem in root.iter('*'):
    if elem.text is not None:
        elem.text = elem.text.strip()

print(etree.tostring(root))

It yields:

<Person><name>My Name</name>
<Address>My Address</Address>
</Person>

UPDATE to strip tail text too:

from lxml import etree

tree = etree.parse('xmlfile')
root = tree.getroot()

for elem in root.iter('*'):
    if elem.text is not None:
        elem.text = elem.text.strip()
    if elem.tail is not None:
        elem.tail = elem.tail.strip()

print(etree.tostring(root, encoding="utf-8", xml_declaration=True))

Upvotes: 8

Birei
Birei

Reputation: 36262

You can use . Do traverse all elements and for each one that contains some text, replace it with its stripped version:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('xmlfile', 'r'), 'xml')

for elem in soup.find_all():
    if elem.string is not None:
        elem.string = elem.string.strip()

print(soup)

Assuming xmlfile with the content provided in the question, it yields:

<?xml version="1.0" encoding="utf-8"?>
<Person>
<name>My Name</name>
<Address>My Address</Address>
</Person>

Upvotes: 1

Basel Shishani
Basel Shishani

Reputation: 8177

You have to do xml parsing for this one way or another, so maybe use xml.sax and copy to the output stream at each event (skipping ignorableWhitespace), and add tag markers as needed. Check the sample code here http://www.knowthytools.com/2010/03/sax-parsing-with-python.html.

Upvotes: 2

Related Questions