voodoo-burger
voodoo-burger

Reputation: 2153

Beautifulsoup use of find()

I've inherited a function from another developer that is supposed to check whether a body parameter of an email message is an HTML body or plaintext. If it is HTML it attaches a plain and an html version of the body to the message, and if the body is not html it only attaches a plain body.

def insertBody(self, body):
    if bool(BeautifulSoup(body, "html.parser").find()):
        b = MIMEMultipart('alternative')
        b.attach( MIMEText(html2text.html2text(body),'plain') )
        b.attach( MIMEText(body,'html') )
    else:
        b = MIMEText(body,'text')
    self._msg.attach(b)
    return

The problem is that it doesn't seem to detect when only a plain body is passed, it only works when I submit a body with <html> and <body> tags. I'm thinking it could be the use of the find() function, but I'm not familiar enough with Beautifulsoup to be able to tell. Am I on the right track?

Upvotes: 0

Views: 51

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1123260

That test has three problems:

  • You don't need to use bool() in an if test, because the if statement already does the exact same thing.

  • The test is way too simplistic. As soon as the text has a < character in it, followed by text and then at any point later on by a > character, the test will pass:

    >>> from bs4 import BeautifulSoup
    >>> BeautifulSoup('<foo bar\n baz> spam ham', 'html.parser').find()
    <foo bar="" baz=""> spam ham</foo>
    
  • Using BeautifulSoup to do a full parse is overkill, the same test can be performed much more efficiently with:

    import re
    
    if re.search('<[^>]+>', body):
        # ...
    

A regular expression could be tuned to look for HTML tags that are actually valid, like:

html = re.compile('<(?:html|head|body)[^>]*>', flags=re.I)
if html.search(body):

The above detects opening <html>, <head> or <body> tags; adjust as needed to how precise you need this detection to be (there is always a trade-off between precision and false-positives).

Upvotes: 1

Related Questions