bastian
bastian

Reputation: 1192

BeautifulSoup does incorrectly filter out text

I want to filter out HTML tags from a text e-mail. Unfortunately I'm facing an issue for a particular e-mail that contains a < sign. I'm expecting BeautifulSoup to ignore the "tag" but it filters out everything after the < sign. I tried with all the parsers.

Edit: Actually I found out that html.parser (the default parser) does the job. I just couldn't spot my typo at first. But why are the other parsers having this issue?

from bs4 import BeautifulSoup

html = r"""Hallo,
&nbsp;
------------------&nbsp;Original&nbsp;------------------
From: &nbsp;"foo"<[email protected]&gt;;
Subject: &nbsp;New message

Expected
"""
for parser in ('html5lib', 'html.parser', 'lxml'):
    soup = BeautifulSoup(html, parser)
    text = soup.get_text()
    print(parser)
    print(text)
    print("Expected" in text)

The output is as follows:

html5lib
Hallo,
 
------------------ Original ------------------
From:  "foo"
False
html.parser
Hallo,
 
------------------ Original ------------------
From:  "foo"<[email protected]>;
Subject:  New message

Expected

True
lxml
Hallo,
 
------------------ Original ------------------
From:  "foo"
False

Edited: I'm expecting the output to be something like

Hallo,
 
------------------ Original ------------------
From:  "foo"<[email protected]>;
Subject:  New message

Expected

Upvotes: 0

Views: 68

Answers (1)

lemonhead
lemonhead

Reputation: 5518

You are trying to parse a string that is only partially html-escaped; many parsers will not handle for this gracefully because it simply doesn't parse as you expect in the "language" the parser describes.
The < should actually be escaped as &lt; or else the entire rest of the email should be unescaped (and then properly escaped, e.g. with html.escape before passing to parser)

That being said, it looks like the html.parser parser does properly handle for this, e.g. if you fix the hardcoded parser and typo in your code above:

for parser in ('html5lib', 'html.parser', 'lxml'):
    soup = BeautifulSoup(html, parser)
    text = soup.get_text()
    print(parser)
    print(text)
    print("Expected" in text)

Outputs:

------------------ Original ------------------
From:  "foo"
False
html.parser
Hallo,
 
------------------ Original ------------------
From:  "foo"<[email protected]>;
Subject:  New message

Expected

True
lxml
Hallo,
 
------------------ Original ------------------
From:  "foo"
False

Upvotes: 1

Related Questions