Reputation: 57
I have a problem that my xpath is not working.
I am trying to get the url from Google.com's search result list into a string list.
But i am unable to reach on url using Xpath.
Please help me in correcting my xpath. Also tell me what should be on the place of ??
HtmlWeb hw = new HtmlWeb();
List<string> urls = new List<string>();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://www.google.com/search?q=" +txtURL.Text.Replace(" " , "+"));
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//div[@class='f kv']");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.Attributes["?????????"];
urls.Add(link.Value);
}
for (int i = 0; i <= urls.Count - 1; i++)
{
if (urls.ElementAt(i) != null)
{
if (IsValid(urls.ElementAt(i)) != true)
{
grid.Rows.Add(urls.ElementAt(i));
}
}
}
Upvotes: 0
Views: 536
Reputation: 40576
The correct XPath is "//div[@class='kv']/cite"
. The f
class you see in the browser element inspector is (probably) added after the page is rendered using javascript.
Also, the link text is not in an attribute, you can get it using the InnerText
property of the <div>
element(s) obtained at the earlier step.
I changed these lines and it works:
var linkNodes = doc.DocumentNode.SelectNodes("//div[@class='kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
urls.Add(linkNode.InnerText);
}
There's a caveat though: some links are trimmed (you'll see a ...
in the middle)
Upvotes: 0
Reputation: 499382
The URLs seem to live in the cite
element under that selected div
s, so the XPath to select those is //div[@class='f kv']/cite
.
Now, since these contain markup but you only want the text, select the InnerText
of the selected nodes. Note that these do not begin with http://
.
HtmlNodeCollection linkNodes =
doc.DocumentNode.SelectNodes("//div[@class='f kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.InnerText;
urls.Add(link.Value);
}
Upvotes: 1