Stephen Strosko
Stephen Strosko

Reputation: 665

Encoding Issues when reading .htm files with Python

I am attempting to read in a large set of .htm files with Python. To do so I am using the following:

HtmlFile = codecs.open(file, 'r')
text = BeautifulSoup(HtmlFile.read()).text

However, this results in the following error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 411: 
character maps to <undefined>

So, I tried encoding with utf-8 like so:

HtmlFile = codecs.open(file, 'r', encoding='utf-8')
text = BeautifulSoup(HtmlFile.read()).text

And then I got this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 4565: 
invalid start byte

I tried following the advice here, but it was not helping. Any help would be greatly appreciated!

Upvotes: 2

Views: 1687

Answers (1)

AER
AER

Reputation: 1531

I've done a bit of research and it's an issue with a Microsoft generated file using the CP1252 encoding, however there are some things that are not picked up correctly. Given the following:

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 15 (filtered)">

in your html file this seems more than likely.

According to this answer, if you use Latin-1 encoding for that example it could help:

HtmlFile = codecs.open(file, 'r', encoding='latin-1')
text = BeautifulSoup(HtmlFile.read()).text

Let me know if this works. Beware that Latin-1 does not have all the characters that the Microsoft encodings have though.

Upvotes: 4

Related Questions