Cannot parse ISO-8859-15 encoded XML with bs4

Question

I have the following XML document, saved with Notepad++ in ISO-8859-15 encoding:

I try to parse this file using bs4, but somehow (even when specifying the encoding everywhere I can think of), I get an empty result:

filepath = 'iso-8859-15_example.xml'
with open(filepath, encoding="iso-8859-15") as f:
    soup = BeautifulSoup(f, 'xml', from_encoding="iso-8859-15")
print(soup)
# --> "", otherwise empty

Removing the encoding hints in the Python code does not help. But strangely, what works is deleting the first line of the XML file, which is the statement (called "prolog, I think).

What am I doing wrong here? I thought the prolog would help bs4 to "do the right thing" and choose the correct encoding. Is there an alternative to deleting the prolog/messing with the XML-file encoding?

Martin J.H. · Accepted Answer

Combining Andrej's answer and the answers given in the duplicate question, I can see that specifying raw mode in the open call solves my problem:

from bs4 import BeautifulSoup
from bs4.diagnose import diagnose
with open('iso-8859-15_example.xml', 'rb') as f:
    diagnose(f)

This leads to the output

Diagnostic running on Beautiful Soup 4.7.1
Python version 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 13:35:33) [MSC v.1900 64 bit (AMD64)]
I noticed that html5lib is not installed. Installing it may help.
Found lxml version 4.3.4.0
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:



--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:


 
  
  
 

--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:



--------------------------------------------------------------------------------

and shows that lxml in xml mode works well.

Cannot parse ISO-8859-15 encoded XML with bs4

Answers (2)

Related Questions