How to open HTML file as UTF-8 for parsing it?

Question

I'm trying to parse an html file using BeautifulSoup with python 3, but I get UTF-8 decode error. I've tried adding the option to open file decoding as UTF-8 but the error still appears.

How to fix this?

This is what I have so far.

from bs4 import BeautifulSoup

with open("file.html") as fp:                      
    unicode_html = fp.read().decode('utf-8', 'ignore')  

soup = BeautifulSoup( unicode_html)

Traceback (most recent call last):          
/usr/lib/python3.8/codecs.py", line 322, in decode        

(result, consumed) = self._buffer_decode(data, self.errors, final) 

 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 30287: invalid continuation byte

michael_heath · Accepted Answer

The default mode for open() is rt which is read in text mode. Use rb to read in binary mode. At the moment, the decoder is being fed decoded text which it may not like too much.

The error of UnicodeDecodeError appears to happen possibly due to the output device (like a console) not supporting the encoding.

With a command prompt, the error output is

AttributeError: 'str' object has no attribute 'decode'

which appears more correct error. I was also using a shebang of

#!/usr/bin/env python3 -X utf8

which makes Python output UTF-8 to get the AttributeError.

Change the line:

with open("file.html") as fp:

to

with open("file.html", "rb") as fp:

How to open HTML file as UTF-8 for parsing it?

Answers (1)

Related Questions