Reputation: 7987
I have a piece of html that I'm trying to parse using HtmlAgilityPack. Here's the piece of the code I'm interested in (sorry for using a picture, but it's cleaner and shows the point I want more clearly):
What I'm trying to do is very simple, but I can't figure it out. What I want is to select the div
with id = content
that is highlighted in the image. To do this with HtmlAgilitypack in c# I'm using:
HtmlDocument doc = new HtmlDocument(); //creating HtmlAgilityPack document
doc.LoadHtml(htmlstring); //loading html
var content = doc.DocumentNode.SelectSingleNode("//div[@id='content']"); //running XPATH
The problem is that the last instruction selects the div I mention above, but it's incomplete. Instead of containing all the children shown in the image it only contains one child, the first div
with id = item
The same piece of XPATH when run through Chrome with XPTAH Helper selects the correct div with all its children.
I don't understand if I'm using HtmlAgilityPack incorrectly or if my XPATH expression is incorrect, can anyone give a hint?
Upvotes: 0
Views: 351
Reputation: 101748
Well, you've got some messed up HTML to deal with there. Every one of those item
s contains two malformed <a>
tags.
One is missing its >
at the end of its start tag:
<div id="covershot"><a href="http://www.cineblog01.tv/the-thirteenth-tale-subita-2013/" target="_self" <p><img src="http://www.locandinebest.net/imgk/The_Thirteenth_Tale_2013.jpg"></p>
and the other stops dead after <a class="
and has no closing tag.
<td><div><a class="<div class="fblike_button" style="margin: 10px 0;"><iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fwww.cineblog01.tv%2Fthe-thirteenth-tale-subita-2013%2F&layout=button_count&show_faces=false&width=150&action=like&colorscheme=dark" scrolling="no" frameborder="0" allowTransparency="true" style="border:none; overflow:hidden; width:150px; height:20px"></iframe></div> </div> </td>
I'm guessing that's causing some problems for the parser. Have you tried selecting the wrapper
or contentwrapper
div
s to see if it's putting the missing div
s inside them?
You might try to fix these problems with some string replacement to see if that gets it to parse correctly:
htmlstring = htmlstring.Replace("target=\"_self\" <", "target=\"_self\" ><")
.Replace("<a class=\"<", "<");
Upvotes: 1