Vivek Raj
Vivek Raj

Reputation: 456

HTML agility parsing error

HTML

    <html>
<head>
<title>Sample Page</title>
</head>
<body>
<form action="demo_form.asp" id="form1" method="get">
  First name: <input type="text" name="fname"><br>
  Last name: <input type="text" name="lname"><br>
  <input type="submit" value="Submit">
</form>
</body>
</html>

Code

HtmlDocument doc = new HtmlDocument();    
doc.LoadHtml(File.ReadAllText(@"C:\sample.html"));
HtmlNode nd = doc.DocumentNode.SelectSingleNode("//form[@id='form1']");
//nd.InnerHtml is "".
//nd.InnerText is "".

Problem

nd.ChildNodes //Collection(to get all nodes in form) is always null.
nd.SelectNodes("/input") //returns null.
nd.SelectNodes("./input") //returns null.
"//form[@id='form1']/input" //returns null.

what i want is to access childnodes of form tag with id=form1 one by one in order of occurrence. I tried same xpath in chrome developer console and it works just exactly the way i wanted. Is HTMlAgility pack is having problem in reading html from file or Web.

Upvotes: 1

Views: 753

Answers (2)

Yves Schelpe
Yves Schelpe

Reputation: 3463

Try adding the following statement before loading the document:

HtmlNode.ElementsFlags.Remove("form");

HtmlAgilityPack's default behaviour adds all the form's inner-elements as siblings in stead of children. The statement above alters that behaviour so that they (meaning the input tags) will appear as childnodes.

Your code would look like this:

HtmlNode.ElementsFlags.Remove("form");
HtmlDocument doc = new HtmlDocument();    
doc.LoadHtml(File.ReadAllText(@"C:\sample.html"));
HtmlNode nd = doc.DocumentNode.SelectSingleNode("//form[@id='form1']");
etc...

references:

  1. bug issue & fix: http://htmlagilitypack.codeplex.com/workitem/23074
  2. codeplex forum post: http://htmlagilitypack.codeplex.com/discussions/247206

Upvotes: 0

John Lay
John Lay

Reputation: 301

Your html is invalid and may be preventing the html agility pack from working properly.

Try adding a doctype (and an xml namespace) to the start of your document and change your input element's closing tags from > to />

Upvotes: 1

Related Questions