Robin Rodricks
Robin Rodricks

Reputation: 113976

Parse malformed XML

I'm trying to load a piece of (possibly) malformed HTML into an XMLDocument object, but it fails with XMLExceptions... since there are extra opening/closing tags, and malformed XML tags such as <img > instead of <img />

How do I get the XML to parse with all the errors in the data? Is there any XML validator that I can apply before parsing, to correct these errors? Or would handling the exception parse whatever can be parsed?

Upvotes: 7

Views: 6472

Answers (6)

Mitchel Sellers
Mitchel Sellers

Reputation: 63126

Depending ont he specific needs, you might be able to use HTML Tidy to cleanup the document, then import it using the XMLDocument object.

Upvotes: 1

Mitch Wheat
Mitch Wheat

Reputation: 300559

You can't load malformed XML into a XmlDocument.

Check out the Html Agility Pack on CodePlex

Upvotes: 0

LBushkin
LBushkin

Reputation: 131676

It's unlikely that you will be able to build an XmlDocument that has this level of malformed structure. XmlDocument (to my knowledge) requires that xml content adhere to proper nesting and closure syntax.

However, you suspect that you could parse this with an XmlReader instead. It may still throw exceptions if certain egregious errors are encountered, but according to the MSDN docs, it can at least disclose the location of the errors.

If you're just dealing with HTML, there is the HTML Agility Pack, which may serve your purposes.

Upvotes: 1

annakata
annakata

Reputation: 75824

You might want to check out the answer to this question.

Basically somewhere between a .NET port of beautifulsoup and the HTML agility pack there is a way.

Upvotes: 2

Marc Gravell
Marc Gravell

Reputation: 1062855

The HTML Agility Pack will parse html, rather than xhtml, and is quite forgiving. The object model will be familiar if you've used XmlDocument.

Upvotes: 15

Brian Genisio
Brian Genisio

Reputation: 48137

What you are trying to do is very difficult. HTML cannot be parsed using an XML parser since XML is strict and HTML is not. If that HTML were compliant XHTML (HTML as XML), then an XML parser would parse the HTML without issue.

You might want to see if there are any HTML to XHTML converters out there, if you really want to use an XML parser for HTML.

In other words, I have yet to meet an XML parser that handles malformed XML... they are not designed to accept loose markup like HTML (for good reason, too :) )

Upvotes: 0

Related Questions