Retrieve attributes and span using HTMLAgilityPack library

Question

In this piece of HTML code:



    
        
        
    

    
        Wolf Eyes
        
        Lower Demos
        
        
    

    
        Year
        2013
    

    
        Genre
        Rock
        Pop

I know how to parse it in other ways, but I would like to retrieve this Info using HTMLAgilityPack library:

Title : Wolf Eyes - Lower Demos
Cover : http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg
Year  : 2013
Genres: Rock, Pop
URL   : http://www.mp3crank.com/wolf-eyes/lower-demos-121866

Which are these html lines:

Title : title="Wolf Eyes - Lower Demos"
Cover : src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg"
Year  : 2013
Genre1: Rock
Genre2: Pop
URL   : href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866"

This is what I'm trying, but I always get an object reference not set exception when trying to select a single node, Sorry but I'm very newbie with HTML, I've tried to follow the steps of this question HtmlAgilityPack basic how to get title and link?

Public Class Form1

    Private htmldoc As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument
    Private htmlnodes As HtmlAgilityPack.HtmlNodeCollection = Nothing

    Private Title As String = String.Empty
    Private Cover As String = String.Empty
    Private Genres As String() = {String.Empty}
    Private Year As Integer = -0
    Private URL as String = String.Empty

    Private Sub Test() Handles MyBase.Shown

        ' Load the html document.
        htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))

        ' Select the (10 items) nodes.
        htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[@class='item']")

        ' Loop trough the nodes.
        For Each node As HtmlAgilityPack.HtmlNode In htmlnodes

            Title = node.SelectSingleNode("//div[@class='release']").Attributes("title").Value
            Cover = node.SelectSingleNode("//div[@class='thumb']").Attributes("src").Value
            Year = CInt(node.SelectSingleNode("//div[@class='release-year']").Attributes("span").Value)
            Genres = ¿select multiple nodes?
            URL = node.SelectSingleNode("//div[@class='release']").Attributes("href").Value

        Next

    End Sub

End Class

Quango · Accepted Answer

Your mistake here it to try to access an attribute of a childnode from the one you've found.

When you call node.SelectSingleNode("//div[@class='release']") you get the correct div returned, but calling .Attributes returns just the attributes for the div tag itself, not any of the inner HTML elements.

It's possible to write XPATH queries that select the sub-node, e.g. //div[@class='release']/a - see http://www.w3schools.com/xpath/xpath_syntax.asp for more information on XPATH. Although the examples are for XML, most of the principles should apply to a HTML document.

Another approach is to use further XPATH calls on the node you've found. I've amended your code to make it work using this approach:

' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))

' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[@class='item']")

' Loop through the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes

    Dim releaseNode = node.SelectSingleNode(".//div[@class='release']")
    'Assumes we find the node and it has a a-tag
    Title = releaseNode.SelectSingleNode(".//a").Attributes("title").Value
    URL = releaseNode.SelectSingleNode(".//a").Attributes("href").Value

    Dim thumbNode = node.SelectSingleNode(".//div[@class='thumb']")
    Cover = thumbNode.SelectSingleNode(".//img").Attributes("src").Value

    Dim releaseYearNode = node.SelectSingleNode(".//div[@class='release-year']")
    Year = CInt(releaseYearNode.SelectSingleNode(".//span").InnerText)

    Dim genreNode = node.SelectSingleNode(".//div[@class='genre']")
    Dim genreLinks = genreNode.SelectNodes(".//a")
    Genres = (From n In genreLinks Select n.InnerText).ToArray()

    Console.WriteLine("Title : {0}", Title)
    Console.WriteLine("Cover : {0}", Cover)
    Console.WriteLine("Year  : {0}", Year)
    Console.WriteLine("Genres: {0}", String.Join(",", Genres))
    Console.WriteLine("URL   : {0}", URL)

Next

Note that in this code we're assuming the document is correctly formed and that each node/element/attribute exists and is correct. You might want to add a lot of error checking to this, e.g. If someNode Is Nothing Then ....

Edit: I've amended the code above slightly, to ensure each .SelectSingleNode uses the ".//" prefix - this ensures it works if there are several "item" nodes, otherwise it selects the first match from the document not the current node.

If you want a shorter XPATH solution, here is the same code using that approach:

' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))

' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[@class='item']")

' Loop through the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes

    Title = node.SelectSingleNode(".//div[@class='release']/h4/a[@title]").Attributes("title").Value
    URL = node.SelectSingleNode(".//div[@class='release']/h4/a[@href]").Attributes("href").Value

    Cover = node.SelectSingleNode(".//div[@class='thumb']/a/img[@src]").Attributes("src").Value

    Year = CInt(node.SelectSingleNode(".//div[@class='release-year']/span").InnerText)

    Dim genreLinks = node.SelectNodes(".//div[@class='genre']/a")
    Genres = (From n In genreLinks Select n.InnerText).ToArray()

    Console.WriteLine("Title : {0}", Title)
    Console.WriteLine("Cover : {0}", Cover)
    Console.WriteLine("Year  : {0}", Year)
    Console.WriteLine("Genres: {0}", String.Join(",", Genres))
    Console.WriteLine("URL   : {0}", URL)
    Console.WriteLine()

Next

Retrieve attributes and span using HTMLAgilityPack library

Answers (2)

Related Questions