"illegal multibyte sequence" error from BeautifulSoup when Python 3

Question

.html saved to local disk, and I am using BeautifulSoup (bs4) to parse it.

It worked all fine until lately it's changed to Python 3.

I tested the same .html file in another machine Python 2, it works and returned the page contents.

soup = BeautifulSoup(open('page.html'), "lxml")

Machine with Python 3 doesn't work, and it says:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x92 in position 298670: illegal multibyte sequence

Searched around and I tried below but neither worked: (be it 'r', or 'rb' doesn't make big difference)

soup = BeautifulSoup(open('page.html', 'r'), "lxml")
soup = BeautifulSoup(open('page.html', 'r'), 'html.parser')
soup = BeautifulSoup(open('page.html', 'r'), 'html5lib')
soup = BeautifulSoup(open('page.html', 'r'), 'xml')

How can I use Python 3 to parse this html page?

Thank you.

GPhilo · Accepted Answer

It worked all fine until lately it's changed to Python 3.

Python 3 has by default strings encoded in unicode, so when you open a file as text it will try to decode it. Python 2, on the other hand, uses bytestrings, instead and just returns the content of the file as-is. Try opening page.html as a byte object (open('page.html', 'rb')) and see if that works for you.

"illegal multibyte sequence" error from BeautifulSoup when Python 3

Answers (2)

Related Questions

&quot;illegal multibyte sequence&quot; error from BeautifulSoup when Python 3

Answers (2)

Related Questions

"illegal multibyte sequence" error from BeautifulSoup when Python 3