Cannot extract element using HtmlAgilityPack and XPath

Question

I am using the Html Agility pack to select out textual data from within rss xml. For every other node type (title, pubdate, guid .etc) I can select out the inner-text using XPath conventions however when querying "//link" or indeed "item/link" empty strings are returned.

public static IEnumerable ExtractAllLinks(string rssSource)
{
    //Create a new document.
    var document = new HtmlDocument();
    //Populate the document with an rss file.
    document.LoadHtml(rssSource);
    //Select out all of the required nodes.
    var itemNodes = document.DocumentNode.SelectNodes("item/link");
    //If zero nodes were found, return an empty list, otherwise return the content of those nodes.
    return itemNodes == null ? new List() : itemNodes.Select(itemNode => itemNode.InnerText).ToList();
}

Does anybody have an understanding of why this element behaves differently to the others?

Additional: Running "item/link" returns zero nodes. Running "//link" returns the correct number of nodes however the inner text is zero chars in length.

Using the below test data, with "//name" returns a single record for "fred" however with "//link" a single record with an empty string is returned.

Hello WorldFred

I am certain its because of the world "link". If I change it to "linkz" it works perfectly.

The below workaround works perfectly. However I would like to understand why searching on "//link" does not work as other elements do.

public static IEnumerable ExtractAllLinks(string rssSource)
{
    rssSource = rssSource.Replace("", "");
    rssSource = rssSource.Replace("", "");
    //Create a new document.
    var document = new HtmlDocument();
    //Populate the document with an rss file.
    document.LoadHtml(rssSource);
    //Select out all of the required nodes.
    var itemNodes = document.DocumentNode.SelectNodes("//link-renamed");
    //If zero nodes were found, return an empty list, otherwise return the content of those nodes.
    return itemNodes == null ? new List() : itemNodes.Select(itemNode => itemNode.InnerText).ToList();
}

har07 · Accepted Answer

If you print the DocumentNode.OuterHtml, you will see the problem :

var html = @"Hello WorldFred";
var doc = new HtmlDocument();
doc.LoadHtml(html);
Console.WriteLine(doc.DocumentNode.OuterHtml);

output :

Hello WorldFred

link happen to be one of some special tags^* that is treated as self-closing tag by HAP. You can alter this behavior by setting ElementsFlags before parsing the HTML, for example :

var html = @"Hello WorldFred";
HtmlNode.ElementsFlags.Remove("link");  //remove link from list of special tags
var doc = new HtmlDocument();
doc.LoadHtml(html);
Console.WriteLine(doc.DocumentNode.OuterHtml);
var links = doc.DocumentNode.SelectNodes("//link");
foreach (HtmlNode link in links)
{
    Console.WriteLine(link.InnerText);
}

Dotnetfiddle Demo

output :

Hello WorldFred
Hello World

*) Complete list of the special tags besides link, that included in the ElementsFlags dictionary by default, can be seen in the source code of HtmlNode.cs. Some of the most popular among them are , , , ,

, , etc.

Cannot extract <link> element using HtmlAgilityPack and XPath

Answers (1)

Related Questions

Cannot extract &lt;link&gt; element using HtmlAgilityPack and XPath

Answers (1)

Related Questions

Cannot extract <link> element using HtmlAgilityPack and XPath