duc hathaway
duc hathaway

Reputation: 467

BeautifulSoup: what's the difference between 'lxml' and 'html.parser' and 'html5lib' parsers?

When using Beautiful Soup what is the difference between 'lxml' and "html.parser" and "html5lib"?

When would you use one over the other and the benefits of each? When I used each they seemed to be interchangeable, but people here correct me that I should be using a different one. I'd like to strengthen my understanding; I've read a couple posts on here about this but they're not going over the uses much in any at all.

Example:

soup = BeautifulSoup(response.text, 'lxml')

Upvotes: 37

Views: 26235

Answers (2)

Vinícius Figueiredo
Vinícius Figueiredo

Reputation: 6508

From the docs's summarized table of advantages and disadvantages:

  1. html.parser - BeautifulSoup(markup, "html.parser")

    • Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)

    • Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)

  2. lxml - BeautifulSoup(markup, "lxml")

    • Advantages: Very fast, Lenient

    • Disadvantages: External C dependency

  3. html5lib - BeautifulSoup(markup, "html5lib")

    • Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5

    • Disadvantages: Very slow, External Python dependency

Upvotes: 48

alecxe
alecxe

Reputation: 473763

The key differences are highlighted in the BeautifulSoup documentation:

The basic reasoning why would you prefer one parser instead of others:

  • html.parser- built-in - no extra dependencies needed
  • html5lib - the most lenient - better use it if HTML is broken
  • lxml - the fastest

Upvotes: 18

Related Questions