UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 261060: character maps to <undefined>

I'm currently trying to extract the href (Emails) from HTML files provided by a client of my company. They sent me 6 months worth of data but I'm unable to extract the emails from 2 particular files. I keep getting the same UnicodeDecodeError everytime no matter what I try. According to my analysis, these files are encoded in "utf-8" format. I'll leave the code down below:

from bs4 import BeautifulSoup as bsoup

url = r"C:\Users\Maximiliano\Documents\enero.html"
soup = bsoup((open(url).read()))

data = [] 
for p in soup.find_all("a"):
    datos = p.get("href")
    if datos[0] != "m":
        pass
    else:
        data.append(datos)
print(data)

I've already tried adding a ".decode("utf-8") after the read but it is not doing anything. Please help me!

file: https://gofile.io/?c=SFM1T3

Upvotes: 0

Views: 3077

Answers (1)

Riccardo Bucco
Riccardo Bucco

Reputation: 15364

As suggested in the comments, you simply have to add the encoding parameter:

soup = bsoup((open(url, encoding="utf-8").read()))

Upvotes: 2

Related Questions