user975343
user975343

Reputation:

HtmlAgilityPack getting <Item> tags

I'm trying to use HTMLAgilityPack in order to parse an html page and get atom:links in which are contained in item tags . Here's a sample of the html :

<item><atom:link href="http://www.nytimes.com/2013/12/09/world/asia/justice-for-abused-
    afghan-women-still-elusive-un-report-says.html?partner=rss&amp;emc=rss" 
    rel="standout" />

I've trying to get only the atom:link in item tags by doing the following :

        List<string> urlList = new List<string>();
        HtmlAgilityPack.HtmlWeb nytRssPage = new HtmlAgilityPack.HtmlWeb();
        HtmlAgilityPack.HtmlDocument nytRssDoc = new HtmlAgilityPack.HtmlDocument();
        nytRssDoc = nytRssPage.Load(rssUrl);

        var items = nytRssDoc.DocumentNode.Descendants("item").ToList();// list of <item> tags
        foreach (var item in items)
        {
            var atomLink = item.SelectSingleNode("atom:link");
            string articleUrl = atomLink.InnerText;
            urlList.Add(articleUrl);
        }

The urlList is empty and I guess I've done something wrong. It would be great if anyone could point me to the solution, thanks in advance.

Upvotes: 0

Views: 700

Answers (3)

L.B
L.B

Reputation: 116118

To parse an xml, you don't need HtmlAgilityPack

var url = "http://www.nytimes.com/services/xml/rss/nyt/International.xml";
var xDoc = XDocument.Load(url);

XNamespace atom = "http://www.w3.org/2005/Atom";

var items = xDoc.Descendants("item")
            .Select(item => new
            {
                Title = (string)item.Element("title"),
                Url = item.Element(atom + "link") != null 
                          ? (string)item.Element(atom + "link").Attribute("href") 
                          : (string)item.Element("link")
            })
            .ToList();

Alternatively, you can use SyndicationFeed class too

var url = "http://www.nytimes.com/services/xml/rss/nyt/International.xml";
var xDoc = XDocument.Load(url);
SyndicationFeed feed = SyndicationFeed.Load(xDoc.CreateReader());

Now you can loop feed.Items .

Upvotes: 1

user2718944
user2718944

Reputation: 371

var doc = new HtmlDocument();
doc.LoadHtml(
    "<item><atom:link href=\"http://www.nytimes.com/2013/12/09/world/asia/justice-for-abused-afghan-women-still-elusive-un-report-says.html?partner=rss&amp;emc=rss\" rel=\"standout\" />");

var urls = doc.DocumentNode
    .SelectNodes("//item/*[name()='atom:link']")
    .SelectMany(node => node.Attributes.AttributesWithName("href").Select(attr => attr.Value))
    .ToList();

Upvotes: 1

jessehouwing
jessehouwing

Reputation: 114651

The following code extracts all links:

 var links = doc.DocumentNode.SelectNodes(@"//item/*[name()='atom:link']/@href");

If you want to grab them from each item node, you'll need to use:

 var link = item.SelectSingleNode(@"./*[name()='atom:link']/@href");

And I still suggest you load the Atom feed in a proper XML structure (using Linq to XML or an XPathNavigable) or using a dedicated Atom library like Atom.NET or the Windows feeds API or the Google Feed API.

Upvotes: 2

Related Questions