Reputation: 103
Trying to scrape a .pdf from a site but the XPath is being stubborn.
Site I'm trying to get the .pdf from
xpath given by inspect > copy > copy xpath:
//*[@id="content"]/div/table[2]/tbody/tr[0]/td[3]/a
For some reason /tbody
does nothing but cause an issue. Removing it has worked for all other Xpath I'm using, and seems to be the way to go here as well.
//*[@id="content"]/div/table[2]/tr[0]/td[3]/a
This yields the result:
<img width="16" height="16" src="/apps/cba/g_doctype_pdf.gif" border="0"><br><small>Download<br>Agreement</small>
Which seems to be a child node?
In any case backing the xpath up a bit to:
//*[@id="content"]/div/table[2]/tr[0]/td[3]
gets me
<a target="_blank" href="/apps/cba/docs/1088-CBA6-2017_Redacted.pdf"><img width="16" height="16" src="/apps/cba/g_doctype_pdf.gif" border="0"><br><small>Download<br>Agreement</small></a>
This is nice since all I need is the value in the href
attribute and I can reconstruct the URL and so on. I'm not a wizard with XPath but it seems to me that this final adjustment should get me what I want:
//*[@id="content"]/div/table[2]/tr[0]/td[3]/@href
However it returns the tag again. I'm stumped on this. Any suggestions?
Edit:
The marked solution made it apparent to me that I was making an assumption. I assumed that I could dereference the href tag in the same manner that I was dereferencing other nodes. This is not the case, and I had to adjust my dereferencing to something like this:
var node_collection = hdoc.DocumentNode.SelectNodes(@"//*[@id=""content""]/div/table[2]/tr[1]/td[3]/a/@href");
string output = node[0].Attributes["href"].Value
The problem was not with the Xpath at all. The problem was my lack of understanding of the HtmlDocument object that I was dealing with. Pasting whre I was trying to get at the href tag would have made this obvious to anyone experienced. Being too self conscious about copy-pasting my whole block of messy code made it impossible for anyone to help me. Learn from my mistakes kids, robust sections of code make it easier to accurately identify the problem.
Upvotes: 1
Views: 84
Reputation: 18980
You are right, tbody
is added by Chromes on Copy XPath and should be removed since it is not present in the raw HTML code.*
Selecting the href
attribute should work as suggested: //*[@id="content"]/div/table[2]/tr[1]/td[3]/a/@href
I could load the first href like this:
HtmlWeb web = new HtmlWeb();
HtmlDocument hdoc = web.Load("https://work.alberta.ca/apps/cba/searchresults.asp?query=&employer=&union=&locality=&local=&effective_fy=&effective_fm=&effective_ty=&effective_tm=&expiry_fy=&expiry_fm=&expiry_ty=&expiry_tm=");
var nav = (HtmlNodeNavigator)hdoc.CreateNavigator();
var val = nav.SelectSingleNode(@"//*[@id=""content""]/div/table[2]/tr[1]/td[3]/a/@href").Value;
Or all of them like this:
XPathNavigator nav2 = hdoc.CreateNavigator();
XPathNodeIterator xiter = nav2.Select(@"//*[@id=""content""]/div/table[2]/tr/td[3]/a/@href");
while (xiter.MoveNext())
{
Console.WriteLine(xiter.Current.Value);
}
* However, some engines indeed require tbody
to be present in the XPath as demonstrated here. Only then we get a result. See this answer why tbody
is added by Chrome, Firebug, and alike in the first place.
Upvotes: 1