Foertsch
Foertsch

Reputation: 115

Extract links from href tag via HtmlAgilityPack (nodes collection)

Encountered a problem when trying to extract nodes via XPath... I'm trying to extract links from the href tag of <a> class, the html code itself looks like this:

<span class="purchase-attachment"><a class="purchase-attachment__downloadLink fileLink" href="https://example.com" target="_blank" data-host="example.com" title="title"><span class="purchase-attachment__icon purchase-attachment__docIcon"><svg viewBox="0 0 16 16" class="_389HCiTc17xVEAZm1afRWB" fill="currentColor" focusable="false"><path fill-rule="evenodd" d="M10.0029297,10.9990234 L4.99902344,10.9990234 L4.99902344,10.0009766 L10.0029297,10.0009766 L10.0029297,10.9990234 Z M10.0029297,9.00292969 L4.99902344,9.00292969 L4.99902344,7.99804688 L10.0029297,7.99804688 L10.0029297,9.00292969 Z M8.99804688,7 L4.99902344,7 L4.99902344,6.00195312 L8.99804688,6.00195312 L8.99804688,7 Z M4.00097656,3.99902344 L4.00097656,13.0019531 L11.0009766,13.0019531 L11.0009766,7 L8,3.99902344 L4.00097656,3.99902344 Z M4.00097656,14 C3.70019381,14 3.45865977,13.9088551 3.27636719,13.7265625 C3.09407461,13.5442699 3.00292969,13.3027359 3.00292969,13.0019531 L3.00292969,3.99902344 C3.00292969,3.69824068 3.09407461,3.45670664 3.27636719,3.27441406 C3.45865977,3.09212148 3.70019381,3.00097656 4.00097656,3.00097656 L8.41699219,3.00097656 L11.9990234,6.58300781 L11.9990234,13.0019531 C11.9990234,13.2845066 11.898764,13.5214834 11.6982422,13.7128906 C11.4977204,13.9042978 11.2653008,14 11.0009766,14 L4.00097656,14 Z"></path></svg></span><span class="purchase-attachment__icon purchase-attachment__externalIcon"><svg viewBox="0 0 16 16" class="_389HCiTc17xVEAZm1afRWB" fill="currentColor" focusable="false"><path d="M14 4H11L12.1464 5.14644L7.64645 9.64642L8.35356 10.3535L12.8535 5.85355L14 7V4Z"></path><path d="M4 6H9.87866L7.87866 8H4V12H11V9.1213L12 8.1213V12C12 12.5523 11.5523 13 11 13H4C3.44772 13 3 12.5523 3 12V7C3 6.44772 3.44772 6 4 6Z"></path></svg></span><span class="purchase-attachment__fullName"><span class="purchase-attachment__fileName">name</span><span class="purchase-attachment__extension">.doc</span></span></a></span>

My code looks like this:

doc.DocumentNode.SelectNodes("//span[@class='purchase-attachment']//a[@class='purchase-attachment__downloadLink fileLink']")

On the output I get: Nothing I'm new, and I'm still having a hard time with XPath... But ultimately, I want to get the InnerText of the link after the href tag (https://example.com). These links are located inside the "a" class, immediately after the " span class = 'purchase-attachment'"

Please tell me how to correctly write an expression for extracting InnerText in the href tag?

Upvotes: 0

Views: 369

Answers (1)

aepot
aepot

Reputation: 4824

Wrong XPath here. Let's fix.

HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a[@class='purchase-attachment__downloadLink fileLink']");
foreach (HtmlNode node in nodes)
{
    Console.WriteLine(node.Attributes["href"].Value);
    Console.WriteLine(node.InnerText);
}

In case you're like JQuery or JS QuerySelector, you can install an extension for HtmlAgilityPack: Fizzler.Systems.HtmlAgilityPack

Then the query will look more friendly for a web developer:

HtmlNodeCollection nodes = doc.DocumentNode.QuerySelectorAll("a.purchase-attachment__downloadLink.fileLink");

Upvotes: 1

Related Questions