postoronnim
postoronnim

Reputation: 556

BeautifulSoup cannot read file despite correct charset

I am trying to open a file that has a utf-8 metatag with BeautifulSoup using utf-8, yet I get a parsing error:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open(filename), "html.parser", from_encoding="utf-8")

File header:

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   Logs
  </title>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>

Error:

$ python3.6 dom.py Traceback (most recent call last): File "dom.py", line 56, in soup = BeautifulSoup(open(filename), "html.parser", from_encoding="utf-8") File "/usr/local/lib/python3.6/site-packages/bs4/init.py", line 309, in init markup = markup.read() File "/usr/local/lib/python3.6/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 152902: ordinal not in range(128)

How should I proceed in debugging this? Thanks

Upvotes: 1

Views: 686

Answers (1)

iamzeid
iamzeid

Reputation: 116

You didn't open the file correctly.

from bs4 import BeautifulSoup

with open(filename, "r", encoding="utf-8") as f:
    soup = BeautifulSoup(f, "html.parser", from_encoding="utf-8")

OR

from bs4 import BeautifulSoup

f = open(filename, "r", encoding="utf-8").read()
soup = BeautifulSoup(f, "html.parser", from_encoding="utf-8")
f.close()

Upvotes: 1

Related Questions