Reputation: 8737
I'm using Python and BeautifulSoup to parse and access elements from an XML document. I modify the values of a couple of the elements and then write the XML back into the file. The trouble is that the updated XML file contains newlines at the start and end of each XML element's text values, resulting in a file that looks like this:
<annotation>
<folder>
Definitiva
</folder>
<filename>
armas_229.jpg
</filename>
<path>
/tmp/tmpygedczp5/handgun/images/armas_229.jpg
</path>
<size>
<width>
1800
</width>
<height>
1426
</height>
<depth>
3
</depth>
</size>
<segmented>
0
</segmented>
<object>
<name>
handgun
</name>
<pose>
Unspecified
</pose>
<truncated>
0
</truncated>
<difficult>
0
</difficult>
<bndbox>
<xmin>
1001
</xmin>
<ymin>
549
</ymin>
<xmax>
1453
</xmax>
<ymax>
1147
</ymax>
</bndbox>
</object>
</annotation>
Instead I'd rather have the output file look like this:
<annotation>
<folder>Definitiva</folder>
<filename>armas_229.jpg</filename>
<path>/tmp/tmpygedczp5/handgun/images/armas_229.jpg</path>
<size>
<width>1800</width>
<height>1426</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>handgun</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>1001</xmin>
<ymin>549</ymin>
<xmax>1453</xmax>
<ymax>1147</ymax>
</bndbox>
</object>
</annotation>
I open the file and get the "soup" like so:
with open(pascal_xml_file_path) as pascal_file:
pascal_contents = pascal_file.read()
soup = BeautifulSoup(pascal_contents, "xml")
After I've completed modifying a couple of the document's values I rewrite the document back into the file using BeautifulSoup.prettify
like so:
with open(pascal_xml_file_path, "w") as pascal_file:
pascal_file.write(soup.prettify())
My assumption is that the BeautifulSoup.prettify
is adding these superfluous/gratuitous newlines by default, and there doesn't appear to be a good way to modify this behavior. Have I missed something in the BeautifulSoup documentation, or am I truly unable to modify this behavior and need to use another approach for outputting the XML to file? Maybe I'm just better off rewriting this using xml.etree.ElementTree
instead?
Upvotes: 6
Views: 6712
Reputation: 7945
Prettification logic
# Recursive function (do not call this method)
def _get_prettified(tag, curr_indent, indent):
out = ''
for x in tag.find_all(recursive=False):
if len(x.find_all()) == 0:
content = x.string.strip(' \n')
else:
content = '\n' + _get_prettified(x, curr_indent + ' ' * indent, indent) + curr_indent
attrs = ' '.join([f'{k}="{v}"' for k,v in x.attrs.items()])
out += curr_indent + ('<%s %s>' % (x.name, attrs) if len(attrs) > 0 else '<%s>' % x.name) + content + '</%s>\n' % x.name
return out
# Call this method
def get_prettified(tag, indent):
return _get_prettified(tag, '', indent);
Your input
source = """<annotation>
<folder>
Definitiva
</folder>
<filename>
armas_229.jpg
</filename>
<path>
/tmp/tmpygedczp5/handgun/images/armas_229.jpg
</path>
<size>
<width>
1800
</width>
<height>
1426
</height>
<depth>
3
</depth>
</size>
<segmented>
0
</segmented>
<object>
<name>
handgun
</name>
<pose>
Unspecified
</pose>
<truncated>
0
</truncated>
<difficult>
0
</difficult>
<bndbox>
<xmin>
1001
</xmin>
<ymin>
549
</ymin>
<xmax>
1453
</xmax>
<ymax>
1147
</ymax>
</bndbox>
</object>
</annotation>"""
Output
bs = BeautifulSoup(source, 'html.parser')
output = get_prettified(bs, indent=2)
print(output)
# Prints following
<annotation>
<folder>Definitiva</folder>
<filename>armas_229.jpg</filename>
<path>/tmp/tmpygedczp5/handgun/images/armas_229.jpg</path>
<size>
<width>1800</width>
<height>1426</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>handgun</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>1001</xmin>
<ymin>549</ymin>
<xmax>1453</xmax>
<ymax>1147</ymax>
</bndbox>
</object>
</annotation>
Run your code here: https://replit.com/@bikcrum/BeautifulSoup-Prettifier
Upvotes: 0
Reputation: 8737
It turns out to be straight-forward to get the indentation I want if I instead use xml.etree.ElementTree
instead of BeautifulSoup. For example, below is some code that reads an XML file, cleans off any newlines/whitespace from text elements, and then writes the tree as an XML file.
import argparse
from xml.etree import ElementTree
# ------------------------------------------------------------------------------
def reformat(
input_xml: str,
output_xml: str,
):
tree = ElementTree.parse(input_xml)
# remove extraneous newlines and whitespace from text elements
for element in tree.getiterator():
if element.text:
element.text = element.text.strip()
# write the updated XML into the annotations output directory
tree.write(output_xml)
# ------------------------------------------------------------------------------
if __name__ == "__main__":
# parse the command line arguments
args_parser = argparse.ArgumentParser()
args_parser.add_argument(
"--in",
required=True,
type=str,
help="file path of original XML",
)
args_parser.add_argument(
"--out",
required=True,
type=str,
help="file path of reformatted XML",
)
args = vars(args_parser.parse_args())
reformat(
args["in"],
args["out"],
)
Upvotes: 0
Reputation: 8215
My assumption is that the BeautifulSoup.prettify is adding these superfluous/gratuitous newlines by default, and there doesn't appear to be a good way to modify this behavior.
YES
It is doing so in two methods of the bs4.Tag
class decode
and decode_contents
.
If you just need a temporary fix, you can monkey patch these two methods
Here is my implementation
from bs4 import Tag, NavigableString, BeautifulSoup
from bs4.element import AttributeValueWithCharsetSubstitution, EntitySubstitution
def decode(
self, indent_level=None,
eventual_encoding="utf-8", formatter="minimal"
):
if not callable(formatter):
formatter = self._formatter_for_name(formatter)
attrs = []
if self.attrs:
for key, val in sorted(self.attrs.items()):
if val is None:
decoded = key
else:
if isinstance(val, list) or isinstance(val, tuple):
val = ' '.join(val)
elif not isinstance(val, str):
val = str(val)
elif (
isinstance(val, AttributeValueWithCharsetSubstitution)
and eventual_encoding is not None
):
val = val.encode(eventual_encoding)
text = self.format_string(val, formatter)
decoded = (
str(key) + '='
+ EntitySubstitution.quoted_attribute_value(text))
attrs.append(decoded)
close = ''
closeTag = ''
prefix = ''
if self.prefix:
prefix = self.prefix + ":"
if self.is_empty_element:
close = '/'
else:
closeTag = '</%s%s>' % (prefix, self.name)
pretty_print = self._should_pretty_print(indent_level)
space = ''
indent_space = ''
if indent_level is not None:
indent_space = (' ' * (indent_level - 1))
if pretty_print:
space = indent_space
indent_contents = indent_level + 1
else:
indent_contents = None
contents = self.decode_contents(
indent_contents, eventual_encoding, formatter)
if self.hidden:
# This is the 'document root' object.
s = contents
else:
s = []
attribute_string = ''
if attrs:
attribute_string = ' ' + ' '.join(attrs)
if indent_level is not None:
# Even if this particular tag is not pretty-printed,
# we should indent up to the start of the tag.
s.append(indent_space)
s.append('<%s%s%s%s>' % (
prefix, self.name, attribute_string, close))
has_tag_child = False
if pretty_print:
for item in self.children:
if isinstance(item, Tag):
has_tag_child = True
break
if has_tag_child:
s.append("\n")
s.append(contents)
if not has_tag_child:
s[-1] = s[-1].strip()
if pretty_print and contents and contents[-1] != "\n":
s.append("")
if pretty_print and closeTag:
if has_tag_child:
s.append(space)
s.append(closeTag)
if indent_level is not None and closeTag and self.next_sibling:
# Even if this particular tag is not pretty-printed,
# we're now done with the tag, and we should add a
# newline if appropriate.
s.append("\n")
s = ''.join(s)
return s
def decode_contents(
self,
indent_level=None,
eventual_encoding="utf-8",
formatter="minimal"
):
# First off, turn a string formatter into a function. This
# will stop the lookup from happening over and over again.
if not callable(formatter):
formatter = self._formatter_for_name(formatter)
pretty_print = (indent_level is not None)
s = []
for c in self:
text = None
if isinstance(c, NavigableString):
text = c.output_ready(formatter)
elif isinstance(c, Tag):
s.append(
c.decode(indent_level, eventual_encoding, formatter)
)
if text and indent_level and not self.name == 'pre':
text = text.strip()
if text:
if pretty_print and not self.name == 'pre':
s.append(" " * (indent_level - 1))
s.append(text)
if pretty_print and not self.name == 'pre':
s.append("")
return ''.join(s)
Tag.decode = decode
Tag.decode_contents= decode_contents
After this, when I did print(soup.prettify)
, the output was
<annotation>
<folder>Definitiva</folder>
<filename>armas_229.jpg</filename>
<path>/tmp/tmpygedczp5/handgun/images/armas_229.jpg</path>
<size>
<width>1800</width>
<height>1426</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>handgun</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>1001</xmin>
<ymin>549</ymin>
<xmax>1453</xmax>
<ymax>1147</ymax>
</bndbox>
</object>
</annotation>
I made a lot of assumptions while doing this. Just wanted to show that it is possible.
Upvotes: 2
Reputation: 107587
Consider XSLT with Python's third-party module, lxml
(which you possibly already have with BeautifulSoup
integration). Specifically, call the identity transform to copy XML as is and then run the normalize-space()
template on all text nodes.
XSLT (save as .xsl, a special .xml file or embedded string)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- IDENTITY TRANSFORM -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- RUN normalize-space() ON ALL TEXT NODES -->
<xsl:template match="text()">
<xsl:copy-of select="normalize-space()"/>
</xsl:template>
</xsl:stylesheet>
Python
import lxml.etree as et
# LOAD FROM STRING OR PARSE FROM FILE
str_xml = '''...'''
str_xsl = '''...'''
doc = et.fromstring(str_xml)
style = et.fromstring(str_xsl)
# INITIALIZE TRANSFORMER AND RUN
transformer = et.XSLT(style)
result = transformer(doc)
# PRINT TO SCREEN
print(result)
# SAVE TO DISK
with open('Output.xml', 'wb') as f:
f.write(result)
Upvotes: 1