Suresh Sharma
Suresh Sharma

Reputation: 57

htmlagilitypack xpath incorrect

I have a problem that my xpath is not working.

I am trying to get the url from Google.com's search result list into a string list.

But i am unable to reach on url using Xpath.

Please help me in correcting my xpath. Also tell me what should be on the place of ??

HtmlWeb hw = new HtmlWeb();
List<string> urls = new List<string>();
HtmlAgilityPack.HtmlDocument doc = hw.Load("http://www.google.com/search?q=" +txtURL.Text.Replace(" " , "+"));
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//div[@class='f kv']");
foreach (HtmlNode linkNode in linkNodes)
{
    HtmlAttribute link = linkNode.Attributes["?????????"];
    urls.Add(link.Value);

}
for (int i = 0; i <= urls.Count - 1; i++)
{
    if (urls.ElementAt(i) != null)
    {
        if (IsValid(urls.ElementAt(i)) != true)
        {
            grid.Rows.Add(urls.ElementAt(i));

        }
    }
}

Upvotes: 0

Views: 536

Answers (2)

Cristian Lupascu
Cristian Lupascu

Reputation: 40576

The correct XPath is "//div[@class='kv']/cite". The f class you see in the browser element inspector is (probably) added after the page is rendered using javascript.

Also, the link text is not in an attribute, you can get it using the InnerText property of the <div> element(s) obtained at the earlier step.

I changed these lines and it works:

var linkNodes = doc.DocumentNode.SelectNodes("//div[@class='kv']/cite");

foreach (HtmlNode linkNode in linkNodes)
{
    urls.Add(linkNode.InnerText);
}

There's a caveat though: some links are trimmed (you'll see a ... in the middle)

Upvotes: 0

Oded
Oded

Reputation: 499382

The URLs seem to live in the cite element under that selected divs, so the XPath to select those is //div[@class='f kv']/cite.

Now, since these contain markup but you only want the text, select the InnerText of the selected nodes. Note that these do not begin with http://.

HtmlNodeCollection linkNodes = 
                       doc.DocumentNode.SelectNodes("//div[@class='f kv']/cite");
foreach (HtmlNode linkNode in linkNodes)
{
    HtmlAttribute link = linkNode.InnerText;
    urls.Add(link.Value);
}

Upvotes: 1

Related Questions