Fuzz Evans
Fuzz Evans

Reputation: 2943

How to get URLs on page with HTMLAgilityPack, when the Source does not contain the URLs?

I am trying to scrape the KB Urls from this page: https://support.microsoft.com/en-us/kb/894199

On the page, there are URLs such as: https://support.microsoft.com/kb/2976978

If you open up the developer tools in Chrome, it shows that data is contained like this:

<div class="indent">
<a id="kb-link-142" href="https://support.microsoft.com/kb/2976978" target="_self">https://support.microsoft.com/kb/2976978</a>
</div>

Now based on the above HTML, I believe I should be able to scrape the URLs from the href element like this:

foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
   list.Add(link.GetAttributeValue("href", string.Empty));
}

The problem I am running into though, is that when I download the HTMLSource, the content changes. What I mean is that even though the Developer tools show the above HTML available on the page, if you right click the page and choose to View source, the HTML it shows at that point is totally different, and does not contain any of the URLs that the rendered page displays.

My theory is that there's some kind of file reference where the HTML loads a file somewhere and the file contains the details of the page that is rendered. So how can I use HTMLAgilityPack to get the URLs that are on the rendered page, since the source doesn't seem to contain them?

Also - I realize my question Title may be really confusing. If there is a technical term for what this page is doing/how it works, let me know and I can update the title so it is more logical and others can search it out in the future.

Upvotes: 1

Views: 1073

Answers (1)

mojorisinify
mojorisinify

Reputation: 405

Okay, I see the problem now. This page is using Angularjs directives and bindings, and the hrefs are loading post page load. The page we are getting is before any parsing/execution has happened as from the web browser agent. This means the changes on the page after any DOM manupulation/ javascript or ajax modification will not be included in the HtmlDocument response. I think the way to go about this would be to pretend like a web browser request, let the javascript and ajax execute completely and fetch the content as advised here . Hope this helps!

Upvotes: 1

Related Questions