UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 261060: character maps to

Question

I'm currently trying to extract the href (Emails) from HTML files provided by a client of my company. They sent me 6 months worth of data but I'm unable to extract the emails from 2 particular files. I keep getting the same UnicodeDecodeError everytime no matter what I try. According to my analysis, these files are encoded in "utf-8" format. I'll leave the code down below:

from bs4 import BeautifulSoup as bsoup

url = r"C:\Users\Maximiliano\Documents\enero.html"
soup = bsoup((open(url).read()))

data = [] 
for p in soup.find_all("a"):
    datos = p.get("href")
    if datos[0] != "m":
        pass
    else:
        data.append(datos)
print(data)

I've already tried adding a ".decode("utf-8") after the read but it is not doing anything. Please help me!

file: https://gofile.io/?c=SFM1T3

Riccardo Bucco · Accepted Answer

As suggested in the comments, you simply have to add the encoding parameter:

soup = bsoup((open(url, encoding="utf-8").read()))

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 261060: character maps to <undefined>

Answers (1)

Related Questions

UnicodeDecodeError: &#39;charmap&#39; codec can&#39;t decode byte 0x81 in position 261060: character maps to &lt;undefined&gt;

Answers (1)

Related Questions

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 261060: character maps to <undefined>