Encoding Issues when reading .htm files with Python

Question

I am attempting to read in a large set of .htm files with Python. To do so I am using the following:

HtmlFile = codecs.open(file, 'r')
text = BeautifulSoup(HtmlFile.read()).text

However, this results in the following error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 411: 
character maps to

So, I tried encoding with utf-8 like so:

HtmlFile = codecs.open(file, 'r', encoding='utf-8')
text = BeautifulSoup(HtmlFile.read()).text

And then I got this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 4565: 
invalid start byte

I tried following the advice here, but it was not helping. Any help would be greatly appreciated!

AER · Accepted Answer

I've done a bit of research and it's an issue with a Microsoft generated file using the CP1252 encoding, however there are some things that are not picked up correctly. Given the following:

in your html file this seems more than likely.

According to this answer, if you use Latin-1 encoding for that example it could help:

HtmlFile = codecs.open(file, 'r', encoding='latin-1')
text = BeautifulSoup(HtmlFile.read()).text

Let me know if this works. Beware that Latin-1 does not have all the characters that the Microsoft encodings have though.

Encoding Issues when reading .htm files with Python

Answers (1)

Related Questions