Connor
Connor

Reputation: 49

HtmlAgilityPack Not Finding Specific Node That Should Be There

I'm loading a URL and am looking for a specific node that should exist in the HTML doc but it is returning null every time. In fact, every node that I try to find is returning null. I have used this same code on other web pages but for some reason in this instance it isn't working. Could the HtmlDoc be loading something different than the source I see in my browser?

I'm obviously new to web scraping but have run into this kind of problem multiple times where I have to make an elaborate workaround because I'm unable to select a node that I can see in my browser. Is there something fundamentally wrong with how I'm going about this?

string[] arr = { "abercrombie", "adt" };
for(int i=0;i<1;i++)
{
      string url = @"https://www.google.com/search?rlz=1C1CHBF_enCA834CA834&ei=lsfeXKqsCKOzggf9ub3ICg&q=" + arr[i] + "+ticker" + "&oq=abercrombie+ticker&gs_l=psy-ab.3..35i39j0j0i22i30l2.102876.105833..106007...0.0..0.134.1388.9j5......0....1..gws-wiz.......0i71j0i67j0i131j0i131i67j0i20i263j0i10j0i22i10i30.3zqfY4KZsOg";
      HtmlWeb web = new HtmlWeb();
      var htmlDoc = web.Load(url);
      var node = htmlDoc.DocumentNode.SelectSingleNode("//span[@class = 'HfMth']");
      Console.WriteLine(node.InnerHtml);
}

enter image description here

UPDATE

Thanks to RobertBaron for pointing me in the right direction. Here is a great copy paste solution.

Upvotes: 0

Views: 107

Answers (2)

RobertBaron
RobertBaron

Reputation: 2854

The page that you are trying to scrape has javascript code that runs to load the entire contents of the page. Because your browser runs that javascript, you see the entire contents of the page. The HtmlWeb.Load() does not run any javascript code and so you only see a partial page.

You can use the WebBrowser control to scrape that page. Just like your browser, it will run any javascript code, and the entire page will be loaded. There are several stack overflow articles that show how to do this. Here are some of them.

Upvotes: 1

QHarr
QHarr

Reputation: 84475

That content is dynamically added and not present in what is returned via your current method + url; which is why your xpath is unsuccessful. You can check what is returned with, for example:

var node = htmlDoc.DocumentNode.SelectSingleNode("//*");

Selecting something which is present for your first url - to show you can select a node

var node = htmlDoc.DocumentNode.SelectSingleNode("//span[@class = 'st']");

You can use developer tools > network tab > to see if any specific dynamic content you are after is available by a separate xhr request url.

Upvotes: 0

Related Questions