benst
benst

Reputation: 553

Get all <li> elements from inside a certain <div> with C#

I have a web page consisting of several <div> elements.

I would like to write a program that prints all the li elements inside a <div> after a certain <h4> header. Could anyone give me some help or sample code?

<div id="content">
    <h4>Header</h4>
    <ul>
        <li><a href...></a> THIS IS WHAT I WANT TO GET</li>
    </ul>
</div>

Upvotes: 1

Views: 8574

Answers (3)

Ichabod Clay
Ichabod Clay

Reputation: 2011

If all you want is the stuff that's between all <li></li> tags underneath the <div id="content"> tag and comes right after a <h4> tag, then this should suffice:

//Load your document first.
//Load() accepts a Stream, a TextReader, or a string path to the file on your computer
//If the entire document is loaded into a string, then use .LoadHtml() instead.
HtmlDocument mainDoc = new HtmlDocument();
mainDoc.Load("c:\foobar.html");


//Select all the <li> nodes that are inside of an element with the id of "content"
// and come directly after an <h4> tag.
HtmlNodeCollection processMe = mainDoc.GetElementbyId("content")
                                      .SelectNodes("//h4/following-sibling::*[1]//li");

//Iterate through each <li> node and print the inner text to the console
foreach (HtmlNode listElement in processMe)
{
    Console.WriteLine(listElement.InnerText);
}

Upvotes: 0

Anand
Anand

Reputation: 14935

If its a web page why would you need to do HTML Parsing. Would not the technology that you are using to build the web page would give access to all the element of the page. For example if you are using ASP.NET, you could assign id's to your UL and LI(with runat server tag) and they would be available in code behind ?

Could you explain your scenario what you are trying to do ? If you trying to make a web request, download the html as string, then scrapping the HTML would make sense

EDIT Think this should work

HtmlDocument doc = new HtmlDocument();
doc.Load(myHtmlFile);

    foreach (HtmlNode p in doc.DocumentNode.SelectNodes("//div"))
    {
        if(p.Attributes["id"].Value == "content")
        {
            foreach(HtmlNode child in p.ChildNodes.SelectNodes("//ul"))
            {
                if(p.PreviousSibling.InnerText() == "Header")
                {
                    foreach(HtmlNode liNodes in p.ChildNodes)
                    {
                        //liNodes represent all childNode
                    }
                }
        }
    }

Upvotes: 0

Ray Hayes
Ray Hayes

Reputation: 15015

When it come to parsing HTML in C#, don't try to write your own. The HTML Agility Pack is almost certainly capable of doing what you want!

What parts are constant:

  1. The 'id' in the DIV?
  2. The h4

Searching a complete HTML document and reacting on H4 alone is likely to be a mess, whereas if you know the DIV has the ID of "content" then just look for that!

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(yourHtml);

if ( doc.DocumentNode != null )
{
   var divs = doc.DocumentNode
                 .SelectNodes("//div")
                 .Where(e => e.Descendants().Any(e => e.Name == "h4"));

   // You now have all of the divs with an 'h4' inside of it.

   // The rest of the element structure, if constant needs to be examined to get
   // the rest of the content you're after.
}

Upvotes: 2

Related Questions