Reputation: 1
I'm currently trying to extract the href (Emails) from HTML files provided by a client of my company. They sent me 6 months worth of data but I'm unable to extract the emails from 2 particular files. I keep getting the same UnicodeDecodeError everytime no matter what I try. According to my analysis, these files are encoded in "utf-8" format. I'll leave the code down below:
from bs4 import BeautifulSoup as bsoup
url = r"C:\Users\Maximiliano\Documents\enero.html"
soup = bsoup((open(url).read()))
data = []
for p in soup.find_all("a"):
datos = p.get("href")
if datos[0] != "m":
pass
else:
data.append(datos)
print(data)
I've already tried adding a ".decode("utf-8") after the read but it is not doing anything. Please help me!
file: https://gofile.io/?c=SFM1T3
Upvotes: 0
Views: 3077
Reputation: 15364
As suggested in the comments, you simply have to add the encoding
parameter:
soup = bsoup((open(url, encoding="utf-8").read()))
Upvotes: 2