Reputation: 179
My xml file is encoding thus:
<?xml version="1.0" encoding="utf-8"?>
I am trying to parse this file using beautiful soup.
from bs4 import BeautifulSoup
fd = open("xmlsample.xml")
soup = BeautifulSoup(fd,'lxml-xml',from_encoding='utf-8')
But this results in
Traceback (most recent call last):
File "C:\Users\gregg_000\Desktop\Python
Experiments\NRE_XMLtoCSV\NRE_XMLtoCSV\bs1.py", line 4, in <module>
soup = BeautifulSoup(fd,'lxml-xml', from_encoding='utf-8')
File
"C:\Users\gregg_000\AppData\Local\Programs\Python\Python36\lib\site-
packages\bs4__init__.py", line 245, in init markup = markup.read() File
"C:\Users\gregg_000\AppData\Local\Programs\Python\Python36\lib\encodings\cp125 2.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5343910: character maps to undefined
My sense is that Python is wanting to use the default cp1252 character set. How can I force utf-8 without having to resort to the command line? (I'm in a set-up where I can't easily force global changes to the python set up).
Upvotes: 6
Views: 1002
Reputation: 12581
You should also add the encoding to your open()
call (it's an acceptable argument as the docs indicate). By default in Windows (at least in my install), the default is, as you guessed, cp1252.
from bs4 import BeautifulSoup
fd = open("xmlsample.xml", encoding='utf-8')
soup = BeautifulSoup(fd,'lxml-xml',from_encoding='utf-8')
Upvotes: 1