craybobnee
craybobnee

Reputation: 103

Trying to select slippery href attribute with xpath in c#

Trying to scrape a .pdf from a site but the XPath is being stubborn.

Site I'm trying to get the .pdf from

xpath given by inspect > copy > copy xpath:

//*[@id="content"]/div/table[2]/tbody/tr[0]/td[3]/a

For some reason /tbody does nothing but cause an issue. Removing it has worked for all other Xpath I'm using, and seems to be the way to go here as well.

//*[@id="content"]/div/table[2]/tr[0]/td[3]/a

This yields the result:

<img width="16" height="16" src="/apps/cba/g_doctype_pdf.gif" border="0"><br><small>Download<br>Agreement</small>

Which seems to be a child node?

In any case backing the xpath up a bit to:

//*[@id="content"]/div/table[2]/tr[0]/td[3]

gets me

<a target="_blank" href="/apps/cba/docs/1088-CBA6-2017_Redacted.pdf"><img width="16" height="16" src="/apps/cba/g_doctype_pdf.gif" border="0"><br><small>Download<br>Agreement</small></a>

This is nice since all I need is the value in the href attribute and I can reconstruct the URL and so on. I'm not a wizard with XPath but it seems to me that this final adjustment should get me what I want:

//*[@id="content"]/div/table[2]/tr[0]/td[3]/@href

However it returns the tag again. I'm stumped on this. Any suggestions?

Edit:

The marked solution made it apparent to me that I was making an assumption. I assumed that I could dereference the href tag in the same manner that I was dereferencing other nodes. This is not the case, and I had to adjust my dereferencing to something like this:

var node_collection = hdoc.DocumentNode.SelectNodes(@"//*[@id=""content""]/div/table[2]/tr[1]/td[3]/a/@href");
string output = node[0].Attributes["href"].Value

The problem was not with the Xpath at all. The problem was my lack of understanding of the HtmlDocument object that I was dealing with. Pasting whre I was trying to get at the href tag would have made this obvious to anyone experienced. Being too self conscious about copy-pasting my whole block of messy code made it impossible for anyone to help me. Learn from my mistakes kids, robust sections of code make it easier to accurately identify the problem.

Upvotes: 1

Views: 84

Answers (1)

wp78de
wp78de

Reputation: 18980

You are right, tbody is added by Chromes on Copy XPath and should be removed since it is not present in the raw HTML code.*

Selecting the href attribute should work as suggested: //*[@id="content"]/div/table[2]/tr[1]/td[3]/a/@href

I could load the first href like this:

HtmlWeb web = new HtmlWeb();
HtmlDocument hdoc = web.Load("https://work.alberta.ca/apps/cba/searchresults.asp?query=&employer=&union=&locality=&local=&effective_fy=&effective_fm=&effective_ty=&effective_tm=&expiry_fy=&expiry_fm=&expiry_ty=&expiry_tm=");

var nav = (HtmlNodeNavigator)hdoc.CreateNavigator();
var val = nav.SelectSingleNode(@"//*[@id=""content""]/div/table[2]/tr[1]/td[3]/a/@href").Value;

Or all of them like this:

XPathNavigator nav2 = hdoc.CreateNavigator();
XPathNodeIterator xiter = nav2.Select(@"//*[@id=""content""]/div/table[2]/tr/td[3]/a/@href");
while (xiter.MoveNext())
{
    Console.WriteLine(xiter.Current.Value);
}

* However, some engines indeed require tbody to be present in the XPath as demonstrated here. Only then we get a result. See this answer why tbody is added by Chrome, Firebug, and alike in the first place.

Upvotes: 1

Related Questions