broke
broke

Reputation: 8302

Dealing with malformed HTML using HTML Agility Pack

I'm trying to scrape an HTML table full of data on a website. Unfortunately, the source code for the table looks like this:

<table border="1" cellspacing="0" cellpadding="3">

<tr>
<td bgcolor="silver"><font face="arial,helvetica" size="1">Last Name</font></td>

<td bgcolor="silver"><font face="arial,helvetica" size="1">First Name</font></td>

<td bgcolor="silver"><font face="arial,helvetica" size="1">Middle</font></td>
</tr>

<td valign="top"><font face="arial,helvetica" size="1">
Data</font></td>

<td valign="top"><font face="arial,helvetica" size="1">
Data</font></td>

<td valign="top"><font face="arial,helvetica" size="1">
Data</font></td>
</tr>   

<td valign="top"><font face="arial,helvetica" size="1">
More Data</font></td>

<td valign="top"><font face="arial,helvetica" size="1">
More Data</font></td>

<td valign="top"><font face="arial,helvetica" size="1">
More Data</font></td>
</tr>
</table>

Note the lack of staring "tr" tags for each row after the header. The table shows up fine in a browser, but the html agility pack will not recognized the tr elements with no start tag. Is there anyway I can get the html agility pack to fix this issue? Id rather not insert the tr tags myself, but will if I have to.

Upvotes: 1

Views: 983

Answers (1)

L.B
L.B

Reputation: 116098

You can try to parse the tds and group them by 3 items,

var list = doc.DocumentNode.Descendants("td")
            .Select((td, i) => new { td, i })
            .GroupBy(x => x.i / 3)
            .Select(g => g.Select(t => t.td.InnerText).ToList())
            .ToList();

Upvotes: 2

Related Questions