James Adams
James Adams

Reputation: 8737

How to output XML from BeautifulSoup without extraneous newlines?

I'm using Python and BeautifulSoup to parse and access elements from an XML document. I modify the values of a couple of the elements and then write the XML back into the file. The trouble is that the updated XML file contains newlines at the start and end of each XML element's text values, resulting in a file that looks like this:

<annotation>
 <folder>
  Definitiva
 </folder>
 <filename>
  armas_229.jpg
 </filename>
 <path>
  /tmp/tmpygedczp5/handgun/images/armas_229.jpg
 </path>
 <size>
  <width>
   1800
  </width>
  <height>
   1426
  </height>
  <depth>
   3
  </depth>
 </size>
 <segmented>
  0
 </segmented>
 <object>
  <name>
   handgun
  </name>
  <pose>
   Unspecified
  </pose>
  <truncated>
   0
  </truncated>
  <difficult>
   0
  </difficult>
  <bndbox>
   <xmin>
    1001
   </xmin>
   <ymin>
    549
   </ymin>
   <xmax>
    1453
   </xmax>
   <ymax>
    1147
   </ymax>
  </bndbox>
 </object>
</annotation>

Instead I'd rather have the output file look like this:

<annotation>
 <folder>Definitiva</folder>
 <filename>armas_229.jpg</filename>
 <path>/tmp/tmpygedczp5/handgun/images/armas_229.jpg</path>
 <size>
  <width>1800</width>
  <height>1426</height>
  <depth>3</depth>
 </size>
 <segmented>0</segmented>
 <object>
  <name>handgun</name>
  <pose>Unspecified</pose>
  <truncated>0</truncated>
  <difficult>0</difficult>
  <bndbox>
   <xmin>1001</xmin>
   <ymin>549</ymin>
   <xmax>1453</xmax>
   <ymax>1147</ymax>
  </bndbox>
 </object>
</annotation>

I open the file and get the "soup" like so:

    with open(pascal_xml_file_path) as pascal_file:
        pascal_contents = pascal_file.read()
    soup = BeautifulSoup(pascal_contents, "xml")

After I've completed modifying a couple of the document's values I rewrite the document back into the file using BeautifulSoup.prettify like so:

    with open(pascal_xml_file_path, "w") as pascal_file:
        pascal_file.write(soup.prettify())

My assumption is that the BeautifulSoup.prettify is adding these superfluous/gratuitous newlines by default, and there doesn't appear to be a good way to modify this behavior. Have I missed something in the BeautifulSoup documentation, or am I truly unable to modify this behavior and need to use another approach for outputting the XML to file? Maybe I'm just better off rewriting this using xml.etree.ElementTree instead?

Upvotes: 6

Views: 6712

Answers (4)

bikram
bikram

Reputation: 7945

I wrote a code to do a prettification without any extra library.

Prettification logic

# Recursive function (do not call this method)
def _get_prettified(tag, curr_indent, indent):
    out =  ''
    for x in tag.find_all(recursive=False):
        if len(x.find_all()) == 0:
            content = x.string.strip(' \n')
        else:
            content = '\n' + _get_prettified(x, curr_indent + ' ' * indent, indent) + curr_indent
    
        attrs = ' '.join([f'{k}="{v}"' for k,v in x.attrs.items()])
        out += curr_indent + ('<%s %s>' % (x.name, attrs) if len(attrs) > 0 else '<%s>' % x.name) + content + '</%s>\n' % x.name
    
    return out 
    
# Call this method
def get_prettified(tag, indent):
    return _get_prettified(tag, '', indent);

Your input

source = """<annotation>
 <folder>
  Definitiva
 </folder>
 <filename>
  armas_229.jpg
 </filename>
 <path>
  /tmp/tmpygedczp5/handgun/images/armas_229.jpg
 </path>
 <size>
  <width>
   1800
  </width>
  <height>
   1426
  </height>
  <depth>
   3
  </depth>
 </size>
 <segmented>
  0
 </segmented>
 <object>
  <name>
   handgun
  </name>
  <pose>
   Unspecified
  </pose>
  <truncated>
   0
  </truncated>
  <difficult>
   0
  </difficult>
  <bndbox>
   <xmin>
    1001
   </xmin>
   <ymin>
    549
   </ymin>
   <xmax>
    1453
   </xmax>
   <ymax>
    1147
   </ymax>
  </bndbox>
 </object>
</annotation>"""

Output

bs = BeautifulSoup(source, 'html.parser')
output = get_prettified(bs, indent=2)
print(output)

# Prints following
<annotation>
  <folder>Definitiva</folder>
  <filename>armas_229.jpg</filename>
  <path>/tmp/tmpygedczp5/handgun/images/armas_229.jpg</path>
  <size>
    <width>1800</width>
    <height>1426</height>
    <depth>3</depth>
  </size>
  <segmented>0</segmented>
  <object>
    <name>handgun</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <difficult>0</difficult>
    <bndbox>
      <xmin>1001</xmin>
      <ymin>549</ymin>
      <xmax>1453</xmax>
      <ymax>1147</ymax>
    </bndbox>
  </object>
</annotation>

Run your code here: https://replit.com/@bikcrum/BeautifulSoup-Prettifier

Upvotes: 0

James Adams
James Adams

Reputation: 8737

It turns out to be straight-forward to get the indentation I want if I instead use xml.etree.ElementTree instead of BeautifulSoup. For example, below is some code that reads an XML file, cleans off any newlines/whitespace from text elements, and then writes the tree as an XML file.

import argparse
from xml.etree import ElementTree


# ------------------------------------------------------------------------------
def reformat(
        input_xml: str,
        output_xml: str,
):
    tree = ElementTree.parse(input_xml)

    # remove extraneous newlines and whitespace from text elements
    for element in tree.getiterator():
        if element.text:
            element.text = element.text.strip()

    # write the updated XML into the annotations output directory
    tree.write(output_xml)


# ------------------------------------------------------------------------------
if __name__ == "__main__":

    # parse the command line arguments
    args_parser = argparse.ArgumentParser()
    args_parser.add_argument(
        "--in",
        required=True,
        type=str,
        help="file path of original XML",
    )
    args_parser.add_argument(
        "--out",
        required=True,
        type=str,
        help="file path of reformatted XML",
    )
    args = vars(args_parser.parse_args())

    reformat(
        args["in"],
        args["out"],
    )

Upvotes: 0

Bitto
Bitto

Reputation: 8215

My assumption is that the BeautifulSoup.prettify is adding these superfluous/gratuitous newlines by default, and there doesn't appear to be a good way to modify this behavior.

YES

It is doing so in two methods of the bs4.Tag class decode and decode_contents.

Ref: Source file on github

If you just need a temporary fix, you can monkey patch these two methods

Here is my implementation

from bs4 import Tag, NavigableString, BeautifulSoup
from bs4.element import AttributeValueWithCharsetSubstitution, EntitySubstitution


def decode(
    self, indent_level=None,
    eventual_encoding="utf-8", formatter="minimal"
):
    if not callable(formatter):
        formatter = self._formatter_for_name(formatter)

    attrs = []
    if self.attrs:
        for key, val in sorted(self.attrs.items()):
            if val is None:
                decoded = key
            else:
                if isinstance(val, list) or isinstance(val, tuple):
                    val = ' '.join(val)
                elif not isinstance(val, str):
                    val = str(val)
                elif (
                    isinstance(val, AttributeValueWithCharsetSubstitution)
                    and eventual_encoding is not None
                ):
                    val = val.encode(eventual_encoding)

                text = self.format_string(val, formatter)
                decoded = (
                    str(key) + '='
                    + EntitySubstitution.quoted_attribute_value(text))
            attrs.append(decoded)
    close = ''
    closeTag = ''
    prefix = ''
    if self.prefix:
        prefix = self.prefix + ":"

    if self.is_empty_element:
        close = '/'
    else:
        closeTag = '</%s%s>' % (prefix, self.name)

    pretty_print = self._should_pretty_print(indent_level)
    space = ''
    indent_space = ''
    if indent_level is not None:
        indent_space = (' ' * (indent_level - 1))
    if pretty_print:
        space = indent_space
        indent_contents = indent_level + 1
    else:
        indent_contents = None
    contents = self.decode_contents(
        indent_contents, eventual_encoding, formatter)

    if self.hidden:
        # This is the 'document root' object.
        s = contents
    else:
        s = []
        attribute_string = ''
        if attrs:
            attribute_string = ' ' + ' '.join(attrs)
        if indent_level is not None:
            # Even if this particular tag is not pretty-printed,
            # we should indent up to the start of the tag.
            s.append(indent_space)
        s.append('<%s%s%s%s>' % (
                prefix, self.name, attribute_string, close))
        has_tag_child = False
        if pretty_print:
            for item in self.children:
                if isinstance(item, Tag):
                    has_tag_child = True
                    break
            if has_tag_child:
                s.append("\n")
        s.append(contents)
        if not has_tag_child:
            s[-1] = s[-1].strip()
        if pretty_print and contents and contents[-1] != "\n":
            s.append("")
        if pretty_print and closeTag:
            if has_tag_child:
                s.append(space)
        s.append(closeTag)
        if indent_level is not None and closeTag and self.next_sibling:
            # Even if this particular tag is not pretty-printed,
            # we're now done with the tag, and we should add a
            # newline if appropriate.
            s.append("\n")
        s = ''.join(s)
    return s


def decode_contents(
    self,
    indent_level=None,
    eventual_encoding="utf-8",
    formatter="minimal"
):
    # First off, turn a string formatter into a function. This
    # will stop the lookup from happening over and over again.
    if not callable(formatter):
        formatter = self._formatter_for_name(formatter)

    pretty_print = (indent_level is not None)
    s = []
    for c in self:
        text = None
        if isinstance(c, NavigableString):
            text = c.output_ready(formatter)
        elif isinstance(c, Tag):
            s.append(
                c.decode(indent_level, eventual_encoding, formatter)
            )
        if text and indent_level and not self.name == 'pre':
            text = text.strip()
        if text:
            if pretty_print and not self.name == 'pre':
                s.append(" " * (indent_level - 1))
            s.append(text)
            if pretty_print and not self.name == 'pre':
                s.append("")
    return ''.join(s)


Tag.decode = decode
Tag.decode_contents= decode_contents

After this, when I did print(soup.prettify), the output was

<annotation>
 <folder>Definitiva</folder>
 <filename>armas_229.jpg</filename>
 <path>/tmp/tmpygedczp5/handgun/images/armas_229.jpg</path>
 <size>
  <width>1800</width>
  <height>1426</height>
  <depth>3</depth>
 </size>
 <segmented>0</segmented>
 <object>
  <name>handgun</name>
  <pose>Unspecified</pose>
  <truncated>0</truncated>
  <difficult>0</difficult>
  <bndbox>
   <xmin>1001</xmin>
   <ymin>549</ymin>
   <xmax>1453</xmax>
   <ymax>1147</ymax>
  </bndbox>
 </object>
</annotation>

I made a lot of assumptions while doing this. Just wanted to show that it is possible.

Upvotes: 2

Parfait
Parfait

Reputation: 107587

Consider XSLT with Python's third-party module, lxml (which you possibly already have with BeautifulSoup integration). Specifically, call the identity transform to copy XML as is and then run the normalize-space() template on all text nodes.

XSLT (save as .xsl, a special .xml file or embedded string)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="yes"/>
    <xsl:strip-space elements="*"/>

    <!-- IDENTITY TRANSFORM -->
    <xsl:template match="@*|node()">
      <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
      </xsl:copy>
    </xsl:template>

    <!-- RUN normalize-space() ON ALL TEXT NODES -->
    <xsl:template match="text()">
        <xsl:copy-of select="normalize-space()"/>
    </xsl:template>            
</xsl:stylesheet>

Python

import lxml.etree as et

# LOAD FROM STRING OR PARSE FROM FILE
str_xml = '''...'''    
str_xsl = '''...'''

doc = et.fromstring(str_xml)
style = et.fromstring(str_xsl)

# INITIALIZE TRANSFORMER AND RUN 
transformer = et.XSLT(style)
result = transformer(doc)

# PRINT TO SCREEN
print(result)

# SAVE TO DISK
with open('Output.xml', 'wb') as f:
     f.write(result)

Rextester demo

Upvotes: 1

Related Questions