BeautifulSoup does incorrectly filter out text

Question

I want to filter out HTML tags from a text e-mail. Unfortunately I'm facing an issue for a particular e-mail that contains a < sign. I'm expecting BeautifulSoup to ignore the "tag" but it filters out everything after the < sign. I tried with all the parsers.

Edit: Actually I found out that html.parser (the default parser) does the job. I just couldn't spot my typo at first. But why are the other parsers having this issue?

from bs4 import BeautifulSoup

html = r"""Hallo,
 
------------------ Original ------------------
From:  "foo"


The output is as follows:
html5lib
Hallo,
 
------------------ Original ------------------
From:  "foo"
False
html.parser
Hallo,
 
------------------ Original ------------------
From:  "foo";
Subject:  New message

Expected

True
lxml
Hallo,
 
------------------ Original ------------------
From:  "foo"
False

Edited: I'm expecting the output to be something like
Hallo,
 
------------------ Original ------------------
From:  "foo";
Subject:  New message

Expected

lemonhead · Accepted Answer

You are trying to parse a string that is only partially html-escaped; many parsers will not handle for this gracefully because it simply doesn't parse as you expect in the "language" the parser describes.
The < should actually be escaped as < or else the entire rest of the email should be unescaped (and then properly escaped, e.g. with html.escape before passing to parser)

That being said, it looks like the html.parser parser does properly handle for this, e.g. if you fix the hardcoded parser and typo in your code above:

for parser in ('html5lib', 'html.parser', 'lxml'):
    soup = BeautifulSoup(html, parser)
    text = soup.get_text()
    print(parser)
    print(text)
    print("Expected" in text)

Outputs:

------------------ Original ------------------
From:  "foo"
False
html.parser
Hallo,
 
------------------ Original ------------------
From:  "foo";
Subject:  New message

Expected

True
lxml
Hallo,
 
------------------ Original ------------------
From:  "foo"
False

BeautifulSoup does incorrectly filter out text

Answers (1)

Related Questions