Sam Clark-Ash
Sam Clark-Ash

Reputation: 175

Extract a certain part of HTML with XPath and HTMLAbilityPack

I am having an issue with XPath syntax as I dont understand how to use it to extract certain HTML statements. I am trying to load a videos information from a channel page; http://www.youtube.com/user/CinemaSins/videos

I know there is a line that holds all the details from views, title, ID, ect.

Here is what I am trying to get from within the html: enter image description here

Thats line 2836;

<div class="yt-lockup clearfix  yt-lockup-video yt-lockup-grid context-data-item" data-context-item-id="ntgNB3Mb08Y" data-context-item-views="243,456 views" data-context-item-time="9:01" data-context-item-type="video" data-context-item-user="CinemaSins" data-context-item-title="Everything Wrong With The Chronicles Of Riddick In 8 Minutes Or Less">

I'm not sure how, But I have HTML Ability Pack added as a resouce and have started attempts on getting it. Can someone explain how to get all of those details and the XPath syntax involved?

What I have attemped:

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[@class='yt-lockup clearfix  yt-lockup-video yt-lockup-grid context-data-item']//a"))
            {
                if (node.ChildNodes[0].InnerHtml != String.Empty)
                {
                    title.Add(node.ChildNodes[0].InnerHtml);
                }
            }

^ The above code works in only getting the title of each video. But it also has a blank input aswell. Code executed and result is below.

enter image description here

Upvotes: 1

Views: 941

Answers (2)

Sam Clark-Ash
Sam Clark-Ash

Reputation: 175

Seems the answer given to me did not help what so ever so after HEAPS of digging, I finally understand how XPath works and managed to do it myself as seen below;

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[@class='yt-lockup clearfix  yt-lockup-video yt-lockup-grid context-data-item']"))
            {
                String val = node.Attributes["data-context-item-id"].Value;
                videoid.Add(val);
            }

I just had to grab the content within the class. Knowing this made it alot easier to use.

Upvotes: 1

user2758799
user2758799

Reputation:

Your xpath is selecting the <a> element inside the <div>. If you want the attributes of the <div> too, then you need to either

a) select both elements and process them separately. b) run several xpath queries where you specify the exact attribute you want.

Lets go with (a) for this example.

var nodes = doc.DocumentNode.SelectNodes("//div[@class='yt-lockup clearfix  yt-lockup-video yt-lockup-grid context-data-item']");

and get the attributes and title like so:

foreach(var node in nodes)
{
  foreach(var attribute in node.Attributes)
  {
    // ... Get the values of the attributes here.
  }

  var linkNodes = node.SelectNodes("//a"));
  // ... Get the InnerHtml as per your own example.
}

I hope this was clear enough. Good luck.

Upvotes: 1

Related Questions