Reputation: 201
I need to parse invalid HTML files that contain several random elements (like BODY) in random lines all over file. I tried to parse it as XML, but with no luck since this file has invalid XML structure as well(a lot of incorrect attributes in random elements over file). HtmlAgilityPack has failed to read this file as well. It's only reading file before first incorrect element and nothing after it.
Here is small example of such file:
<HTML>
<HEAD>
<TITLE>My title</TITLE>
</HEAD>
<BODY leftmargin=9 topmargin=7 >
<TABLE>
<TR>
<TD>Test</TD>
</TR>
<TR>
<TD>Test</TD>
<TD>Test<TD>
</TR>
<BODY> <-- This is the point where HtmlAgilityPack is stuck --!>
<TR>
<TD>Test</TD>
<TD>Test</TD>
</TR>
<TR>
</BODY>
<TR>
<TD><FONT>Test</FONT></TD>
</TR>
</TABLE>
</BODY>
I'm trying to parse info from that table.
Upvotes: 6
Views: 2111
Reputation: 17272
Let Internet Explorer do the hard work for you - it will do its best to "repair" the broken tag structure into something it understands (which is technically valid XML with correct tag pairings etc.).
Open the HTML in WebBrowser (or Windows.Controls.WebBrowser if you prefer WPF libraries), then you can walk through the DOM via Document property. The DOM will always be correct, no matter how broken the original source was.
No third party libraries needed.
Upvotes: 4
Reputation: 7574
We parsed web pages with invalid html with Html Agility Pack. As I remember it did a pretty good job.
Upvotes: 3
Reputation: 2767
You can use SgmlReader. Of course if your html files are very incorrect, it won't help you.
Upvotes: 0