pivutali
pivutali

Reputation: 161

How to sanitize html with HtmlAgilityPack?

I'm facing a problem in my webscraper, essentially I need to get the decimal number inside the cell team_a_col home:

<th>Med. goal subiti p/p</th>
<td class='team_a_col total'>0.76</td>
<td class='team_a_col home'>0.89
<td class='team_a_col away'>0.62</td></td>

so the result should be: 0.89

but as you can see the html have a bad structure, so instead of get 0.89 I get also the content of team_a_col away with this code:

node.SelectSingleNode(".//td[@class='team_a_col home']").InnerText.Trim();

How can I get only 0.89? The </td> should be before of <team_a_col away..

Upvotes: 1

Views: 1742

Answers (2)

Tim Schmelter
Tim Schmelter

Reputation: 460238

You should set HtmlDocument.FixNestedTags to true:

string html = "<th>Med. goal subiti p/p</th><td class='team_a_col total'>0.76</td><td class='team_a_col home'>0.89<td class='team_a_col away'>0.62</td></td>";

var doc = new HtmlAgilityPack.HtmlDocument
{
    OptionFixNestedTags = true,
    OptionCheckSyntax = true,
    OptionAutoCloseOnEnd = true
};
doc.LoadHtml(html);

string tdText = doc.DocumentNode.SelectSingleNode(".//td[@class='team_a_col home']")?.InnerText.Trim();

With FixNestedTags the result is: 0.89

Upvotes: 3

john.kernel
john.kernel

Reputation: 360

Could you just take whole line and then substring and fetch the data?

var node = doc.DocumentNode.SelectNodes("//htmlelment/htmlelment");

string[] nodeArray = node[0].OuterHtml.Split(' ');

Upvotes: 0

Related Questions