Reputation: 2205
I have the following XML document, saved with Notepad++ in ISO-8859-15 encoding:
<?xml version="1.0" encoding="ISO-8859-15"?>
<someTag>
</someTag>
I try to parse this file using bs4, but somehow (even when specifying the encoding everywhere I can think of), I get an empty result:
filepath = 'iso-8859-15_example.xml'
with open(filepath, encoding="iso-8859-15") as f:
soup = BeautifulSoup(f, 'xml', from_encoding="iso-8859-15")
print(soup)
# --> "<?xml version="1.0" encoding="utf-8"?>", otherwise empty
Removing the encoding hints in the Python code does not help. But strangely, what works is deleting the first line of the XML file, which is the <?xml ... ?>
statement (called "prolog, I think).
What am I doing wrong here? I thought the prolog would help bs4 to "do the right thing" and choose the correct encoding. Is there an alternative to deleting the prolog/messing with the XML-file encoding?
Upvotes: 0
Views: 567
Reputation: 2205
Combining Andrej's answer and the answers given in the duplicate question, I can see that specifying raw mode in the open
call solves my problem:
from bs4 import BeautifulSoup
from bs4.diagnose import diagnose
with open('iso-8859-15_example.xml', 'rb') as f:
diagnose(f)
This leads to the output
Diagnostic running on Beautiful Soup 4.7.1
Python version 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 13:35:33) [MSC v.1900 64 bit (AMD64)]
I noticed that html5lib is not installed. Installing it may help.
Found lxml version 4.3.4.0
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<?xml version="1.0" encoding="ISO-8859-15"?>
<sometag>
</sometag>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<?xml version="1.0" encoding="ISO-8859-15"?>
<html>
<body>
<sometag>
</sometag>
</body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<someTag>
</someTag>
--------------------------------------------------------------------------------
and shows that lxml in xml mode works well.
Upvotes: 2
Reputation: 195448
In this case I would recommend to run BeautifulSoup's diagnose()
function:
from bs4 import BeautifulSoup
from bs4.diagnose import diagnose
with open('iso-8859-15_example.xml', encoding="iso-8859-15") as f:
diagnose(f.read())
On my machine this prints:
Diagnostic running on Beautiful Soup 4.7.1
Python version 3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
Found lxml version 4.3.3.0
Found html5lib version 1.0.1
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<?xml version="1.0" encoding="ISO-8859-15"?>
<sometag>
</sometag>
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<!--?xml version="1.0" encoding="ISO-8859-15"?-->
<html>
<head>
</head>
<body>
<sometag>
</sometag>
</body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<?xml version="1.0" encoding="ISO-8859-15"?>
<html>
<body>
<sometag>
</sometag>
</body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
--------------------------------------------------------------------------------
In this case, I would choose the html.parser
, as It will do the right thing.
So when you do:
soup = BeautifulSoup(f.read(), 'html.parser')
print(soup)
It prints:
<?xml version="1.0" encoding="ISO-8859-15"?>
<sometag>
</sometag>
Upvotes: 0