Reputation: 1192
I want to filter out HTML tags from a text e-mail. Unfortunately I'm facing an issue for a particular e-mail that contains a <
sign. I'm expecting BeautifulSoup to ignore the "tag" but it filters out everything after the <
sign. I tried with all the parsers.
Edit: Actually I found out that html.parser (the default parser) does the job. I just couldn't spot my typo at first. But why are the other parsers having this issue?
from bs4 import BeautifulSoup
html = r"""Hallo,
------------------ Original ------------------
From: "foo"<[email protected]>;
Subject: New message
Expected
"""
for parser in ('html5lib', 'html.parser', 'lxml'):
soup = BeautifulSoup(html, parser)
text = soup.get_text()
print(parser)
print(text)
print("Expected" in text)
The output is as follows:
html5lib
Hallo,
------------------ Original ------------------
From: "foo"
False
html.parser
Hallo,
------------------ Original ------------------
From: "foo"<[email protected]>;
Subject: New message
Expected
True
lxml
Hallo,
------------------ Original ------------------
From: "foo"
False
Edited: I'm expecting the output to be something like
Hallo,
------------------ Original ------------------
From: "foo"<[email protected]>;
Subject: New message
Expected
Upvotes: 0
Views: 68
Reputation: 5518
You are trying to parse a string that is only partially html-escaped; many parsers will not handle for this gracefully because it simply doesn't parse as you expect in the "language" the parser describes.
The <
should actually be escaped as <
or else the entire rest of the email should be unescaped (and then properly escaped, e.g. with html.escape
before passing to parser)
That being said, it looks like the html.parser
parser does properly handle for this, e.g. if you fix the hardcoded parser and typo in your code above:
for parser in ('html5lib', 'html.parser', 'lxml'):
soup = BeautifulSoup(html, parser)
text = soup.get_text()
print(parser)
print(text)
print("Expected" in text)
Outputs:
------------------ Original ------------------
From: "foo"
False
html.parser
Hallo,
------------------ Original ------------------
From: "foo"<[email protected]>;
Subject: New message
Expected
True
lxml
Hallo,
------------------ Original ------------------
From: "foo"
False
Upvotes: 1