Reputation: 2298
I'm trying to parse an html file using BeautifulSoup with python 3, but I get UTF-8 decode error. I've tried adding the option to open file decoding as UTF-8 but the error still appears.
How to fix this?
This is what I have so far.
from bs4 import BeautifulSoup
with open("file.html") as fp:
unicode_html = fp.read().decode('utf-8', 'ignore')
soup = BeautifulSoup( unicode_html)
Traceback (most recent call last): /usr/lib/python3.8/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 30287: invalid continuation byte
Upvotes: 0
Views: 4585
Reputation: 5372
The default mode for open()
is rt
which is read in text mode. Use rb
to read in binary mode. At the moment, the decoder is being fed decoded text which it may not like too much.
The error of UnicodeDecodeError
appears to happen possibly due to the output device (like a console) not supporting the encoding.
With a command prompt, the error output is
AttributeError: 'str' object has no attribute 'decode'
which appears more correct error. I was also using a shebang of
#!/usr/bin/env python3 -X utf8
which makes Python output UTF-8 to get the AttributeError
.
Change the line:
with open("file.html") as fp:
to
with open("file.html", "rb") as fp:
Upvotes: 3