Reputation: 1395

Find Multiple Tables using HTML Agility Pack

I am trying to find the second table ""Team and Opponent Stats" from the following website.

https://www.basketball-reference.com/teams/BOS/2017.html

But my code only shows the first table. I've tried all kinds of XPath combinations e.g. "//table[@id='DataTables_Table_0']/tr/td" , but nothing seems to work.

Here is my code:

var url = "https://www.basketball-reference.com/teams/BOS/2017.html";
        var web = new HtmlWeb();
        var doc = web.Load(url);

        var table1 = doc.DocumentNode
                     .Descendants("tr")
                     .Select(n => n.Elements("td").Select(p => p.InnerText).ToArray());

        foreach (string[] s in table1)
        {
            foreach (string str in s)
            {
                Console.WriteLine(str.ToString());
            }
            //Console.WriteLine(s);
        }

        foreach (var cell in doc.DocumentNode.SelectNodes("//table[@id='DataTables_Table_0']/tr/td"))
        {
            Console.WriteLine(cell.InnerText);
        }

Here is my modified code:

 foreach (HtmlNode tr in doc.DocumentNode.SelectNodes("//table[@id=\"team_and_opponent\"]//tbody"))
        {
            //looping on each row, get col1 and col2 of each row
            HtmlNodeCollection tds = tr.SelectNodes("td");
            for (int i = 0; i < tds.Count; i++)
            {
                Console.WriteLine(tds[i].InnerText);
            }
        }

Here is the html code for the section of the website that I want to scrape.

  <div class="table_outer_container">
      <div class="overthrow table_container" id="div_team_and_opponent">
  <table class="suppress_all stats_table" id="team_and_opponent" data-cols-to-freeze="1"><caption>Team and Opponent Stats Table</caption>
   <colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
   <thead>      
      <tr>
         <th aria-label="&nbsp;" data-stat="player" scope="col" class=" poptip sort_default_asc center">&nbsp;</th>
         <th aria-label="Games" data-stat="g" scope="col" class=" poptip sort_default_asc center" data-tip="Games">G</th>

And here is the latest Agility Pack code I'm using to get the right table.

 foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//*[@id=\"team_and_opponent\"]"))
        {
            string tempStr = table.InnerText;

            foreach (HtmlNode nodecol in table.SelectNodes("//tr"))  ///html/body/div[1]/div[2]/div[2]/div/div/div[3]/table[2]/tbody[2]
            {
                foreach (HtmlNode cell in nodecol.SelectNodes("th|td"))
                {
                    Console.WriteLine("cell: " + cell.InnerHtml.ToString());

I'm still getting a NullReference error message.

Upvotes: 2

Answers (3)

mrbear258

Reputation: 56

It appears that the table is originally loaded as a comment and is then made visible using Javascript.

You should use something like SelectSingleNode on the comment's xpath (//*[@id="all_team_and_opponent"]/comment()) and get the variable's InnerHtml then you just need to turn it into a visible table by removing the comment tag.

I made a very simple version of what you can do and uploaded it as a Gist so you can simply check my solution and integrate it into your program or test it on dotnetfiddle.net.

However if you need to run any JS file you can use any of the following things:

WebBrowser Class

Should be fairly easy for extracting text when mixed with HTML Agility Pack although it might be trickier for images or other element types. Overall it provides decent performance.

Javascript.Net

It allows you to execute scripts using Chrome's V8 JavaScript engine. You'll just have to find out what files changes the content.

Selenium

You can use Selenium+a webdriver for your prefered browser (Chrome, Firefox, PhantomJS). It is somewhat slow but is very flexible. This is probably overkill so I recommend any of the above options.

Upvotes: 0

Jeff Mercado

Reputation: 134621

That is a dynamic web page (is manipulated by client-side javascript) so the content you download from the server and see in HtmlAgilityPack will not match what you ultimately see in a browser. The table is actually coming back from the server as a comment. Fortunately the comment has the full markup for that table so all you really need to do is select the comment, strip out the comment part of the text, parse it as html, then select as usual.

So if you wanted to load this into a data table for instance, you could do this:

var url = "https://www.basketball-reference.com/teams/BOS/2017.html";
var web = new HtmlWeb();
var doc = web.Load(url);
var tableComment = doc.DocumentNode
    .SelectSingleNode("//div[@id='all_team_and_opponent']/comment()");
var table = HtmlNode.CreateNode(tableComment.OuterHtml[4..^3])
    .SelectSingleNode("//table[@id='team_and_opponent']");
var dataTable = ToDataTable(table);

DataTable ToDataTable(HtmlNode node)
{
    var dt= new DataTable();
    dt.BeginInit();
    foreach (var col in node.SelectNodes("thead/tr/th"))
        dt.Columns.Add(col.GetAttributeValue("aria-label", ""), typeof(string));
    dt.EndInit();
    dt.BeginLoadData();
    foreach (var row in node.SelectNodes("tbody/tr"))
        dt.Rows.Add(row.SelectNodes("th|td").Select(t => t.InnerText).ToArray());
    dt.EndLoadData();
    return dt;
}

Upvotes: 3

dscore

Reputation: 21

Check the id of the second table you are looking for. Usually, Ids are meant to be unique within the DOM. So if the first table is called "DataTables_Table_0", the other table you're trying to retrieve might have an Id of "DataTables_Table_1", or something similar. Look at the page's source.