Reputation: 2153
I've inherited a function from another developer that is supposed to check whether a body parameter of an email message is an HTML body or plaintext. If it is HTML it attaches a plain and an html version of the body to the message, and if the body is not html it only attaches a plain body.
def insertBody(self, body):
if bool(BeautifulSoup(body, "html.parser").find()):
b = MIMEMultipart('alternative')
b.attach( MIMEText(html2text.html2text(body),'plain') )
b.attach( MIMEText(body,'html') )
else:
b = MIMEText(body,'text')
self._msg.attach(b)
return
The problem is that it doesn't seem to detect when only a plain body is passed, it only works when I submit a body with <html>
and <body>
tags. I'm thinking it could be the use of the find()
function, but I'm not familiar enough with Beautifulsoup to be able to tell. Am I on the right track?
Upvotes: 0
Views: 51
Reputation: 1123260
That test has three problems:
You don't need to use bool()
in an if
test, because the if
statement already does the exact same thing.
The test is way too simplistic. As soon as the text has a <
character in it, followed by text and then at any point later on by a >
character, the test will pass:
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('<foo bar\n baz> spam ham', 'html.parser').find()
<foo bar="" baz=""> spam ham</foo>
Using BeautifulSoup
to do a full parse is overkill, the same test can be performed much more efficiently with:
import re
if re.search('<[^>]+>', body):
# ...
A regular expression could be tuned to look for HTML tags that are actually valid, like:
html = re.compile('<(?:html|head|body)[^>]*>', flags=re.I)
if html.search(body):
The above detects opening <html>
, <head>
or <body>
tags; adjust as needed to how precise you need this detection to be (there is always a trade-off between precision and false-positives).
Upvotes: 1