Kevin D
Kevin D

Reputation: 177

Retrieving specific URLs with HtmlAgilityPack C#

I'm currently attempting to use HtmlAgilityPack to extract specific links from an html page. I tried using plain C# to force my way in but that turned out to be a real pain. The links are all inside of <div> tags that all have the same class. Here's what I have:

HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load(url);

//this should select only the <div> tags with the class acTrigger
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[@class='acTrigger']"))
{
    //not sure how to dig further in to get the href values from each of the <a> tags
}

and the sites code looks along the lines of this

            <li>
                <div class="acTrigger">
                    <a href="/16014988/d/" onclick="return queueRefinementAnalytics('Category','Battery')">
                        Battery                                 <em>&nbsp;(1)</em>
                    </a>
                </div>
            </li>
            <li>
                <div class="acTrigger">
                    <a href="/15568540/d/" onclick="return queueRefinementAnalytics('Category','Brakes')">
                        Brakes                                 <em>&nbsp;(2)</em>
                    </a>
                </div>
            </li>
            <li>
                <div class="acTrigger">
                    <a href="/11436914/d/1979-honda-ct90-cables-lines" onclick="return queueRefinementAnalytics('Category','Cables/Lines')">
                        Cables/Lines                                 <em>&nbsp;(1)</em>
                    </a>
                </div>
            </li>

There are a lot of links on this page, but the href I need are contained inside of those <a> tags which are nested inside of the <div class="acTrigger"> tags. It would be simple if each <a> shared unique classes, but unfortunately only the <div> tags have classes. What I need to do is grab each one of those hrefs and store them so I can retrieve them later, go to each page, and retrieve more information from each page. I just need a nudge in the right direction to get over this hump, then I should be able to do the other pages as well. I have no previous experience with this HtmlAgilityPack and all the example I find seem to want to extract all the URLs from a page, not specific ones. I just need a link to an example or documentation, any help is greatly appreciated.

Upvotes: 2

Views: 1127

Answers (1)

Tim
Tim

Reputation: 1286

You should be able to change your select to include the <a> tag: //div[@class='acTrigger']/a. That way your HtmlNode is your <a> tag instead of the div.

To store the links you can use GetAttributeValue.

foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[@class='acTrigger']/a"))
{
    // Get the value of the HREF attribute.
    string hrefValue = node.GetAttributeValue( "href", string.Empty );
    // Then store hrefValue for later.
}

Upvotes: 2

Related Questions