Leonardbd
Leonardbd

Reputation: 39

selecting href from <a> node using HtmlAgilityPack

Im trying to learn webscraping and to get the href value from the "a" node using Htmlagilitypack in C#. There is multiple Gridcells within the gridview that has articles with smallercells and I want the "a" node href value from all of them

<div class=Tabpanel>
    <div class=G ridW>
        <div class=G ridCell>
            <article>
                <div class=s mallerCell>
                    <a href="..........">
                </div>
            </article>
        </div>
    </div>
    <div class=r andom>
    </div>
    <div class=r andom>
    </div>
</div>

This is what I have come up with so far, feels like I'm making it way more complicated than it has to be. Where do I go from here? Or is there an easier way to do this?

httpclient = new HttpClient();
var html = await httpclient.GetStringAsync(Url);

var htmldoc = new HtmlDocument();
htmldoc.LoadHtml(html);

var ReceptLista = new List < HtmlNode > ();
ReceptLista = htmldoc.DocumentNode.Descendants("div")
    .Where(node => node.GetAttributeValue("class", "")
        .Equals("GridW")).ToList();

var finalList = new List < HtmlNode > ();
finalList = ReceptLista[0].Descendants("article").ToList();

var finalList2 = new List < List < HtmlNode >> ();
for (int i = 0; i < finalList.Count; i++) {
    finalList2.Add(finalList[i].DescendantNodes().Where(node => node.GetAttributeValue("class", "").Equals("RecipeTeaser-content")).ToList());
}

var finalList3 = new List < List < HtmlNode >> ();

for (int i = 0; i < finalList2.Count; i++) {
    finalList3.Add(finalList2[i].Where(node => node.GetAttributeValue("class", "").Equals("RecipeTeaser-link js-searchRecipeLink")).ToList());
}

Upvotes: 1

Views: 2279

Answers (2)

Jawad
Jawad

Reputation: 11364

Simplest way I'd go about it would be this...

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(text);
    var nodesWithARef = doc.DocumentNode.Descendants("a");

    foreach (HtmlNode node in nodesWithARef)
    {
        Console.WriteLine(node.GetAttributeValue("href", ""));
    }

Reasoning: Using the Descendants function would give you an array of all the links that you're interested in from the entire html. You can go over the nodes and do what you need ... i am simply printing the href.

Another Way to go about it would be to look up all the nodes that have the class named 'smallerCell'. Then, for each of those nodes, look up the href if it exists under that and print it (or do something with it).

    var nodesWithSmallerCells = doc.DocumentNode.SelectNodes("//div[@class='smallerCell']");
    if (nodesWithSmallerCells != null)
        foreach (HtmlNode node in nodesWithSmallerCells)
        {
            HtmlNodeCollection children = node.SelectNodes(".//a");
            if (children != null)
                foreach (HtmlNode child in children)
                    Console.WriteLine(child.GetAttributeValue("href", ""));
        }

Upvotes: 2

Madushan
Madushan

Reputation: 7458

If you can probably make things a lot simpler by using XPath.

If you want all the links in article tags, you can do the following.

var anchors = htmldoc.SelectNodes("//article/a");
var links = anchors.Select(a=>a.attributes["href"].Value).ToList();

I think it is Value. Check with docs.

If you want only the anchor tags that are children of article, and also with class smallerCell, you can change the xpath to //article/div[@class='smallerClass']/a.

you get the idea. I think you're just missing xpath knowledge. Also note that HtmlAgilityPack also has plugins that can add CSS selectors, so that's also an option if you don't want to do xpath.

Upvotes: 3

Related Questions