Beautifulsoup use of find()

Question

I've inherited a function from another developer that is supposed to check whether a body parameter of an email message is an HTML body or plaintext. If it is HTML it attaches a plain and an html version of the body to the message, and if the body is not html it only attaches a plain body.

def insertBody(self, body):
    if bool(BeautifulSoup(body, "html.parser").find()):
        b = MIMEMultipart('alternative')
        b.attach( MIMEText(html2text.html2text(body),'plain') )
        b.attach( MIMEText(body,'html') )
    else:
        b = MIMEText(body,'text')
    self._msg.attach(b)
    return

The problem is that it doesn't seem to detect when only a plain body is passed, it only works when I submit a body with and tags. I'm thinking it could be the use of the find() function, but I'm not familiar enough with Beautifulsoup to be able to tell. Am I on the right track?

Martijn Pieters · Accepted Answer

That test has three problems:

You don't need to use bool() in an if test, because the if statement already does the exact same thing.
The test is way too simplistic. As soon as the text has a < character in it, followed by text and then at any point later on by a > character, the test will pass:
```
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(' spam ham', 'html.parser').find()
 spam ham
```
Using BeautifulSoup to do a full parse is overkill, the same test can be performed much more efficiently with:
```
import re

if re.search('<[^>]+>', body):
    # ...
```

A regular expression could be tuned to look for HTML tags that are actually valid, like:

html = re.compile('<(?:html|head|body)[^>]*>', flags=re.I)
if html.search(body):

The above detects opening , or tags; adjust as needed to how precise you need this detection to be (there is always a trade-off between precision and false-positives).

Beautifulsoup use of find()

Answers (1)

Related Questions