Reputation: 792
The C# HtmlAgilityPack, adds tbody element after LoadHtml function, into DOM tree in tables even if it doesn't exists in original HTML document. How can I disable this?
My algorithm creates some XPATH expressions, by traversing the dom tree and that non existing tbody element inside original document makes the SelectNodes not find desired items. Took me a lot of time to figure this out :|
Is it possible to make SelectNodes also consider nodes added by HtmlAgilityPack?
Example:
<table>
<tr><td>data</td></tr>
</table>
My application would produce this XPATH to extract 'data': //table/tbody/tr/td
The tbody tag in expression was added because its in DOM tree after parsing the html code by HtmlAgilityPack because HtmlAgilityPack added it even if it doesnt exist. Because of that
doc.DocumentNode.SelectNodes("//table/tbody/tr/td");
would fail.
In other words the tr element (HtmlElement) parent TagName is equal to 'TBODY' not 'TABLE'. Also I'm parsing many different web sites so this is one situation.
SelectNodes is searching in original HTML code, not by DOM tree it has after HtmlDocument.LoadHtml, or it doesn't consider 'virtual' elements added by it.
Upvotes: 2
Views: 1854
Reputation: 5767
You don't have to use the full hierarchy.
Just use the following if all you want are the td
s:
doc.DocumentNode.SelectNodes("//table//td");
or just ignore the tbody
node and get all the hierarchy you care about:
doc.DocumentNode.SelectNodes("//table//tr/td");
Upvotes: 1