Jcf
Jcf

Reputation: 201

Is there any way to parse invalid HTML?

I need to parse invalid HTML files that contain several random elements (like BODY) in random lines all over file. I tried to parse it as XML, but with no luck since this file has invalid XML structure as well(a lot of incorrect attributes in random elements over file). HtmlAgilityPack has failed to read this file as well. It's only reading file before first incorrect element and nothing after it.

Here is small example of such file:

<HTML>
<HEAD>
    <TITLE>My title</TITLE>
</HEAD>
<BODY leftmargin=9 topmargin=7 >
    <TABLE>
        <TR>
            <TD>Test</TD>
        </TR>
        <TR>
            <TD>Test</TD>
            <TD>Test<TD>
        </TR>
            <BODY> <-- This is the point where HtmlAgilityPack is stuck --!>
                <TR>
                    <TD>Test</TD>
                    <TD>Test</TD>
                </TR>
                <TR>
            </BODY>
        <TR>
        <TD><FONT>Test</FONT></TD>
        </TR>
    </TABLE>
</BODY>

I'm trying to parse info from that table.

Upvotes: 6

Views: 2111

Answers (3)

Matěj Z&#225;bsk&#253;
Matěj Z&#225;bsk&#253;

Reputation: 17272

Let Internet Explorer do the hard work for you - it will do its best to "repair" the broken tag structure into something it understands (which is technically valid XML with correct tag pairings etc.).

Open the HTML in WebBrowser (or Windows.Controls.WebBrowser if you prefer WPF libraries), then you can walk through the DOM via Document property. The DOM will always be correct, no matter how broken the original source was.

No third party libraries needed.

Upvotes: 4

Eugeniu Torica
Eugeniu Torica

Reputation: 7574

We parsed web pages with invalid html with Html Agility Pack. As I remember it did a pretty good job.

Upvotes: 3

Łukasz Wiatrak
Łukasz Wiatrak

Reputation: 2767

You can use SgmlReader. Of course if your html files are very incorrect, it won't help you.

Upvotes: 0

Related Questions