Reputation: 564
I am trying to open a valid xml file, parse it with lxml-xml
, prettify it, and finally save it to a different file.
My code is as follows:
def main(path_to_config):
with open(f'configs/{path_to_config}', 'r') as file:
contents = file.read()
soup = BeautifulSoup(contents, 'xml')
with open(f'pretty_xml/{path_to_config.split("_")[0]}.xml', 'w') as new_file:
new_file.write(soup.prettify())
Unfortunately no matter what is put into the file, the parse will not generate valid xml. The single line <?xml version="1.0" encoding="utf-8"?>
is all that is saved to the pretty_config/
files. I have validated, with multiple online validators, that the xml I am passing is valid.
I have tried replacing the file.read()
with just the file, no luck. I have also tried replacing this with just a string of xml, which works and validates that my parser is working and something is breaking between the file opening and passing contents to BeautifulSoup
.
Any help with this would be very much appreciated.
UPDATE:
My xml file has a single line, <note><time>twelve</time></note>
.
As a sanity check, I added assert contents == '<note><time>twelve</time></note>'
as when I pass the string to BeautifulSoup
the parser has no problem. This new line threw an AssertionError
, which I am completely unsure how. Should the strings not be identical? I copied the string in the .py file straight to the .xml file, there are no additional whitespaces or any other characters.
Upvotes: 1
Views: 251
Reputation: 564
There was a BOM at the beginning of my file, which was not overwritten by copy pasting from the .py file to the .xml file.
I discovered this thanks to @snakecharmerb's suggestion to use repr(contents)
to view the true representation of my string and discovered the the value was '\'\\ufeff<note><time>twelve</time></note>\''
. The \ufeff is a BOM and needed to be removed.
I added the following lines to the beginning of my function and it fixes the error.
s = open(f'configs/{path_to_config}', mode='r', encoding='utf-8-sig').read()
open(f'configs/{path_to_config}', mode='w', encoding='utf-8').write(s)
Upvotes: 2