Reputation: 921

Parse (extract) content from a html page using .net

I need to parse/extract information from an html page. Basically what I'm doing is loading the page as a string using System.Net.WebClient and using HTML Agility Pack to get content inside html tags (forms, labels, imputs and so on).

However, some content is inside a javascript script tag, like this:

<script type="text/javascript">
//<![CDATA[
var itemCol = new Array();

itemCol[0] = {
    pid: "01010101",
    Desc: "Some desc",
    avail: "Available",
    price: "$10.00"
};

itemCol[1] = {
    pid: "01010101",
    Desc: "Some desc",
    avail: "Available",
    price: "$10.00"
};

//]]>
</script>

So, how could I parse it to a collection in .NET? Can HTML Agility Pack help with that? I really appreciate any help.

Thanks in advance.

Upvotes: 0

Answers (3)

Oded

Reputation: 499302

The HAP will not parse out the javascript for you - the best it will do is parse out the contents of the element.

javascript.net may fit the bill.

Upvotes: 1

Kunal Ranglani

Reputation: 408

using the javascript.net library you can get a collection

 using (JavascriptContext context = new JavascriptContext())
  {
    context.SetParameter("data", new MyObject());

     StringBuilder s = new StringBuilder();

    foreach (XPathNavigator nav in scriptTags)
    {
       s.Append(nav.InnerXml);
    }

  s.Append(";data.item = itemCol;");
  context.Run(s.ToString());

  MyObject o = context.GetParameter("data") as MyObject;

Then just have a datastructure like

   class MyObject
   {
     public object item { get; set; }
   }

Upvotes: 1

Kunal Ranglani

Reputation: 408

what part of the content inside the script tag do you want? What kind of collection are you expecting. You can always select script tags using below

  HtmlDocument document = new HtmlDocument();
  document.Load(downloadedHtml);
  XPathNavigator n = document.CreateNavigator();
  XPathNodeIterator scriptTags = n.Select("//script");

  foreach (XPathNavigator nav in scriptTags)
  {
    string innerXml = nav.InnerXml;

    // Parse inner xml using regex
  }

Upvotes: 1

Parse (extract) content from a html page using .net

Answers (3)

Related Questions