Do html5-compliant parsers process html 4 and older correctly?

Question

Here https://en.wikipedia.org/wiki/Tag_soup#HTML5 it's written:

HTML5 aims to be the most complete solution to the problem of tag soup thus far while remaining as backwards- and forwards-compatible as possible. By contrast to XHTML, which departs from backwards compatibility and takes the approach that parsers should become less tolerant of badly formed markup, HTML5 acknowledges that badly formed HTML code already exists in large quantities and will probably continue to be used, and takes the view that the specification should be expanded to ensure maximum compatibility with such code.

Thus, the HTML 5 specification has altered its definition of HTML syntax both to accommodate common syntax in use today, and to explicitly describe exactly how "badly formed code" should be treated by the parser. The handling of badly formed code now has a place in the specification itself, hopefully reducing the need for future HTML parsers to implement additional, out-of-specification measures for dealing with code that it does not recognize.

Do I understand right then that a html5 parser should parse older html pages (like html 2.0 or html 4) correctly? I need a html parser that would parse normally most of internet pages. So I found Google Gumbo: https://github.com/google/gumbo-parser. It's written there that it's HTML5 parser. Will it suit me then to parse not html5 web pages?

Stefan Haustein · Accepted Answer

Yes, that's one of the main differences between HTML5 and XHTML. You should be able to parse any HTML page with a HTML5 parser.

Do html5-compliant parsers process html 4 and older correctly?

Answers (1)

Related Questions