TMarks
TMarks

Reputation: 564

BeautifulSoup4 not accepting valid XML

I am trying to open a valid xml file, parse it with lxml-xml, prettify it, and finally save it to a different file.

My code is as follows:

def main(path_to_config):
    with open(f'configs/{path_to_config}', 'r') as file:
        contents = file.read()
        soup = BeautifulSoup(contents, 'xml')
        with open(f'pretty_xml/{path_to_config.split("_")[0]}.xml', 'w') as new_file:
            new_file.write(soup.prettify())

Unfortunately no matter what is put into the file, the parse will not generate valid xml. The single line <?xml version="1.0" encoding="utf-8"?> is all that is saved to the pretty_config/ files. I have validated, with multiple online validators, that the xml I am passing is valid.

I have tried replacing the file.read() with just the file, no luck. I have also tried replacing this with just a string of xml, which works and validates that my parser is working and something is breaking between the file opening and passing contents to BeautifulSoup.

Any help with this would be very much appreciated.

UPDATE:

My xml file has a single line, <note><time>twelve</time></note>.

As a sanity check, I added assert contents == '<note><time>twelve</time></note>' as when I pass the string to BeautifulSoup the parser has no problem. This new line threw an AssertionError, which I am completely unsure how. Should the strings not be identical? I copied the string in the .py file straight to the .xml file, there are no additional whitespaces or any other characters.

Upvotes: 1

Views: 251

Answers (1)

TMarks
TMarks

Reputation: 564

There was a BOM at the beginning of my file, which was not overwritten by copy pasting from the .py file to the .xml file.

I discovered this thanks to @snakecharmerb's suggestion to use repr(contents) to view the true representation of my string and discovered the the value was '\'\\ufeff<note><time>twelve</time></note>\''. The \ufeff is a BOM and needed to be removed.

I added the following lines to the beginning of my function and it fixes the error.

s = open(f'configs/{path_to_config}', mode='r', encoding='utf-8-sig').read()
open(f'configs/{path_to_config}', mode='w', encoding='utf-8').write(s)

Upvotes: 2

Related Questions