Martin J.H.
Martin J.H.

Reputation: 2205

Cannot parse ISO-8859-15 encoded XML with bs4

I have the following XML document, saved with Notepad++ in ISO-8859-15 encoding:

<?xml version="1.0" encoding="ISO-8859-15"?>
<someTag>
</someTag>

I try to parse this file using bs4, but somehow (even when specifying the encoding everywhere I can think of), I get an empty result:

filepath = 'iso-8859-15_example.xml'
with open(filepath, encoding="iso-8859-15") as f:
    soup = BeautifulSoup(f, 'xml', from_encoding="iso-8859-15")
print(soup)
# --> "<?xml version="1.0" encoding="utf-8"?>", otherwise empty

Removing the encoding hints in the Python code does not help. But strangely, what works is deleting the first line of the XML file, which is the <?xml ... ?> statement (called "prolog, I think).

What am I doing wrong here? I thought the prolog would help bs4 to "do the right thing" and choose the correct encoding. Is there an alternative to deleting the prolog/messing with the XML-file encoding?

Upvotes: 0

Views: 567

Answers (2)

Martin J.H.
Martin J.H.

Reputation: 2205

Combining Andrej's answer and the answers given in the duplicate question, I can see that specifying raw mode in the open call solves my problem:

from bs4 import BeautifulSoup
from bs4.diagnose import diagnose
with open('iso-8859-15_example.xml', 'rb') as f:
    diagnose(f)

This leads to the output

Diagnostic running on Beautiful Soup 4.7.1
Python version 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 13:35:33) [MSC v.1900 64 bit (AMD64)]
I noticed that html5lib is not installed. Installing it may help.
Found lxml version 4.3.4.0
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<?xml version="1.0" encoding="ISO-8859-15"?>
<sometag>
</sometag>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<?xml version="1.0" encoding="ISO-8859-15"?>
<html>
 <body>
  <sometag>
  </sometag>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<someTag>
</someTag>
--------------------------------------------------------------------------------

and shows that lxml in xml mode works well.

Upvotes: 2

Andrej Kesely
Andrej Kesely

Reputation: 195448

In this case I would recommend to run BeautifulSoup's diagnose() function:

from bs4 import BeautifulSoup

from bs4.diagnose import diagnose

with open('iso-8859-15_example.xml', encoding="iso-8859-15") as f:
    diagnose(f.read())

On my machine this prints:

Diagnostic running on Beautiful Soup 4.7.1
Python version 3.6.8 (default, Jan 14 2019, 11:02:34) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
Found lxml version 4.3.3.0
Found html5lib version 1.0.1

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<?xml version="1.0" encoding="ISO-8859-15"?>
<sometag>
</sometag>
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<!--?xml version="1.0" encoding="ISO-8859-15"?-->
<html>
 <head>
 </head>
 <body>
  <sometag>
  </sometag>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<?xml version="1.0" encoding="ISO-8859-15"?>
<html>
 <body>
  <sometag>
  </sometag>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>

--------------------------------------------------------------------------------

In this case, I would choose the html.parser, as It will do the right thing.

So when you do:

soup = BeautifulSoup(f.read(), 'html.parser')
print(soup)

It prints:

<?xml version="1.0" encoding="ISO-8859-15"?>
<sometag>
</sometag>

Upvotes: 0

Related Questions