Reputation: 2792
I'm looking for a solution for parsing potentially malformed HTML in C++, similar to what Beautiful Soup does in Python.
Normally, just using an XML parser would work, but the specific HTML in this case isn't valid XML/XHTML and can't be properly parsed.
Do libraries/tools for this exist?
Upvotes: 4
Views: 746
Reputation: 9140
According to the documentation LibXml2 is capable of parsing HTML4.
Upvotes: 2
Reputation: 748
You can use HTMLTidy to transform HTML into valid XML and then use any C++ XML parser availiable
Upvotes: 6
Reputation: 1545
I've used Xerces and recommend it for C++. It has both DOM and SAX model.
Upvotes: -1