Reputation: 276
I just started using HtmlAgilityPack to scrape some text from websites. I have experimented and found that some websites are easier than others in regards to getting the proper XPath when using the SelectNodes
method. I believe I am doing something wrong but can't figure it out.
For example when exploring the DOM in Google Chrome, I am able to copy the XPath: //*[@id="page"]/span/table[7]/tbody/tr[1]/td/span[2]/a
then I would do something like..
var search = doc.DocumentNode.SelectNodes("//[@id=\"page\"]//span//table//tr//td//span//a"
When using the search
in a foreach loop
I get a null reference error and sure enough the debugger says search
is null. So I am assuming the XPath is wrong..(or I am doing something else totally wrong) So my question is how exactly do I get the proper XPath for HtmlAgilityPack to find these nodes?
Upvotes: 0
Views: 1239
Reputation: 11398
Following up on what you request in your last comment, the html is fully rendered only after the http get request is returns.
Several javascript calls insert blocks of html into the document.
You want the following of them: loadCompanyProfileData('ContactInfo')
, which generates an http get request that looks like:
http://financials.morningstar.com/cmpind/company-profile/component.action?component=ContactInfo&t=XNAS:AAPL®ion=usa&culture=en-US&cur=&_=1465809033745
.
This returns the email, which you can extract with code like the following: HtmlWeb w = new HtmlWeb(); var doc = w.Load("http://financials.morningstar.com/cmpind/company-profile/component.action?component=ContactInfo&t=XNAS:AAPL®ion=usa&culture=en-US&cur=&_=1465809033745");
var emails = doc.DocumentNode.CssSelect("a")
.Where(a => a.GetAttributeValue("href")
.StartsWith("mailto:"))
.Select(a => a.GetAttributeValue("href")
.Replace("mailto:", string.Empty));
emails ends up containing 1 element, being [email protected].
You problem is to determine what should be the "cur" parameter that the loadCompanyProfileData javascript function uses for each distinct company.
I could not locate in the code where/how is this parameter generated.
One alternative would be to execute a browser emulator (like selenium web driver port for c#) so you can execute javascript code - and run the call to loadCompanyProfileData('ContactInfo')
for each company request.
But I could not get this to work as well, my web drive script execution does not look to be working.
Upvotes: 1