Roel
Roel

Reputation: 764

Reading out a table in C# with HtmlAgilityPack

I have been trying for quite a while but this is my case;

My friend's web application runs a website with quite simple HTML to generate data for charts. I want to get certain values from a table on that page as he requires this information to be stored to a database.

So this is a part of the HTML table;

...
<tr>
    <td width=30 align=center bgcolor=#006699 class=W><font color=white>1</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>7387</td>

    <td width=30 height=25 align=center bgcolor=#006699 class=W><font color=white>2</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>2881</td>

    <td width=30 height=25 align=center bgcolor=#006699 class=W><font color=white>3</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>8782</td>

    <td width=30 height=25 align=center bgcolor=#006699 class=W><font color=white>4</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>5297</td>

    <td width=30 height=25 align=center bgcolor=#006699 class=W><font color=white>5</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>749</td>
</tr>
<tr>
    <td align=center bgcolor=#006699 class=W><font color=white>6</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>3136</td>

    <td height=25 align=center bgcolor=#006699 class=W><font color=white>7</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>8768</td>

    <td height=25 align=center bgcolor=#006699 class=W><font color=white>8</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>9548</td>

    <td height=25 align=center bgcolor=#006699 class=W><font color=white>9</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>6565</td>

    <td height=25 align=center bgcolor=#006699 class=W><font color=white>10</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>142</td>
</tr>
...

What I want to achieve is;

The output of this would be 1=7387 and 8=9548. I got stuck quite fast after trying to find the two td containing the given numbers.

My C# code so far;

using (WebClient webClient = new WebClient())
{
    string completeHTMLCode = webClient.DownloadString("someUrl.php?getChartData=" + chartId);

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(completeHTMLCode);

    foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//td[@...]"))
    {

    }
 }

Am I trying something impossible here?

Upvotes: 1

Views: 1021

Answers (3)

Cyril Gupta
Cyril Gupta

Reputation: 13723

Well, if you only had this table data to work with it could be parsed using HTMLAgilityPack.

The first thing I'd do is do away with foreach to iterate through the tds, I'd use a counter, then use the counter id as an indexer. The code could look like this

for(int i = 1;i <= selectednodes.Count();i++)
{
  if(selectednodes[i-1].InnerHtml.Contains("font")
  {
   if(selectednodes[i-1].FirstChild.Value == "1" || selectednodes[i-1].FirstChild.Value == "8")
   {
      myNodecollection.Add(selectednodes[i])
   }
  }
}

Upvotes: 0

Roy Ashbrook
Roy Ashbrook

Reputation: 854

You can just parse it into a dictionary and look it up that way. I could think of perhaps some better ways to parse it, but this does what you want.

    void Main()
{
    string html = @"<tr>
    <td width=30 align=center bgcolor=#006699 class=W><font color=white>1</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>7387</td>

    <td width=30 height=25 align=center bgcolor=#006699 class=W><font color=white>2</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>2881</td>

    <td width=30 height=25 align=center bgcolor=#006699 class=W><font color=white>3</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>8782</td>

    <td width=30 height=25 align=center bgcolor=#006699 class=W><font color=white>4</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>5297</td>

    <td width=30 height=25 align=center bgcolor=#006699 class=W><font color=white>5</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>749</td>
</tr>
<tr>
    <td align=center bgcolor=#006699 class=W><font color=white>6</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>3136</td>

    <td height=25 align=center bgcolor=#006699 class=W><font color=white>7</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>8768</td>

    <td height=25 align=center bgcolor=#006699 class=W><font color=white>8</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>9548</td>

    <td height=25 align=center bgcolor=#006699 class=W><font color=white>9</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>6565</td>

    <td height=25 align=center bgcolor=#006699 class=W><font color=white>10</font></td>
    <td width=50 bgcolor=#FFFFFF align=center>142</td>
</tr>";

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    int[] nodes = doc.DocumentNode.SelectNodes("//td").Select ( dn =>
        int.Parse(dn.InnerHtml.Contains("font") ? dn.FirstChild.InnerHtml : dn.InnerHtml)
        ).ToArray();

    Dictionary<int,int> d = new Dictionary<int,int>();
    for (int i = 0; i < nodes.Length; i+=2)
        d.Add(nodes[i],nodes[i+1]);

    d.Dump();
    d[1].Dump();
    d[8].Dump();
}

Upvotes: 1

Joel Peltonen
Joel Peltonen

Reputation: 13402

I made a quick CsQuery sample how to accomplish this.

string file = File.ReadAllText("a.html"); // gets the html

CQ dom = file; // initializes csquery
CQ td = dom["td"]; // get all td files

td.Each((i,e) => { // go through each
    if (e.FirstChild != null) // if element has child (font)
    {
        if (e.FirstChild.NodeType != NodeType.TEXT_NODE) // ignore text node
        {
            if (e.FirstChild.InnerText == "1") // if number is 1
            {
                Console.WriteLine(e.NextElementSibling.InnerText); // output the text
            }
            if (e.FirstChild.InnerText == "8") // etc etc
            {
                Console.WriteLine(e.NextElementSibling.InnerText);
            }
        }
    }

});

Console.ReadKey();

Upvotes: 3

Related Questions