Reputation: 665
I am attempting to read in a large set of .htm
files with Python. To do so I am using the following:
HtmlFile = codecs.open(file, 'r')
text = BeautifulSoup(HtmlFile.read()).text
However, this results in the following error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 411:
character maps to <undefined>
So, I tried encoding with utf-8
like so:
HtmlFile = codecs.open(file, 'r', encoding='utf-8')
text = BeautifulSoup(HtmlFile.read()).text
And then I got this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 4565:
invalid start byte
I tried following the advice here, but it was not helping. Any help would be greatly appreciated!
Upvotes: 2
Views: 1687
Reputation: 1531
I've done a bit of research and it's an issue with a Microsoft generated file using the CP1252 encoding, however there are some things that are not picked up correctly. Given the following:
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 15 (filtered)">
in your html file this seems more than likely.
According to this answer, if you use Latin-1 encoding for that example it could help:
HtmlFile = codecs.open(file, 'r', encoding='latin-1')
text = BeautifulSoup(HtmlFile.read()).text
Let me know if this works. Beware that Latin-1 does not have all the characters that the Microsoft encodings have though.
Upvotes: 4