Reputation: 556
I am trying to open a file that has a utf-8 metatag with BeautifulSoup using utf-8, yet I get a parsing error:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open(filename), "html.parser", from_encoding="utf-8")
File header:
<!DOCTYPE html>
<html lang="en">
<head>
<title>
Logs
</title>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
Error:
$ python3.6 dom.py Traceback (most recent call last): File "dom.py", line 56, in soup = BeautifulSoup(open(filename), "html.parser", from_encoding="utf-8") File "/usr/local/lib/python3.6/site-packages/bs4/init.py", line 309, in init markup = markup.read() File "/usr/local/lib/python3.6/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 152902: ordinal not in range(128)
How should I proceed in debugging this? Thanks
Upvotes: 1
Views: 686
Reputation: 116
You didn't open the file correctly.
from bs4 import BeautifulSoup
with open(filename, "r", encoding="utf-8") as f:
soup = BeautifulSoup(f, "html.parser", from_encoding="utf-8")
OR
from bs4 import BeautifulSoup
f = open(filename, "r", encoding="utf-8").read()
soup = BeautifulSoup(f, "html.parser", from_encoding="utf-8")
f.close()
Upvotes: 1