Greg Williams
Greg Williams

Reputation: 179

handling encoding error with xml with beautiful soup

My xml file is encoding thus:

<?xml version="1.0" encoding="utf-8"?>

I am trying to parse this file using beautiful soup.

from bs4 import BeautifulSoup

fd = open("xmlsample.xml")  
soup = BeautifulSoup(fd,'lxml-xml',from_encoding='utf-8')

But this results in

Traceback (most recent call last):
  File "C:\Users\gregg_000\Desktop\Python 
Experiments\NRE_XMLtoCSV\NRE_XMLtoCSV\bs1.py", line 4, in <module>
    soup = BeautifulSoup(fd,'lxml-xml', from_encoding='utf-8')
  File 
"C:\Users\gregg_000\AppData\Local\Programs\Python\Python36\lib\site- 

packages\bs4__init__.py", line 245, in init markup = markup.read() File

"C:\Users\gregg_000\AppData\Local\Programs\Python\Python36\lib\encodings\cp125 2.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5343910: character maps to undefined

My sense is that Python is wanting to use the default cp1252 character set. How can I force utf-8 without having to resort to the command line? (I'm in a set-up where I can't easily force global changes to the python set up).

Upvotes: 6

Views: 1002

Answers (1)

Jonah Bishop
Jonah Bishop

Reputation: 12581

You should also add the encoding to your open() call (it's an acceptable argument as the docs indicate). By default in Windows (at least in my install), the default is, as you guessed, cp1252.

from bs4 import BeautifulSoup

fd = open("xmlsample.xml", encoding='utf-8')
soup = BeautifulSoup(fd,'lxml-xml',from_encoding='utf-8')

Upvotes: 1

Related Questions