Zacke
Zacke

Reputation: 33

How to extract only the headings (i.e h2,h3,h4), from a html string like a numbered TOC?

I want to from a xhtml string to extract everything inside heading tags (i.e h1,h2,h3 etc) to then use in a sidemenu.

The xhtml string will have numbered headings so a h2 will be something like "1.1 Heading", and under that a h3 with "1.1.1 Heading" and 1.1.2 and so on.

<div class="main-body">
    <h2>1.1 Heading</h2>
    <h3>1.1.1 Subheading</h3>
    <p>Lorem ipsum</p>
    <h3>1.1.2 Another Subheading</h3>
    <p>Lorem ipsum</p>

    <h2>2.1 Heading</h2>
    <h3>2.1.1 Subheading</h3>
    <p>Lorem ipsum</p>
    <h4>2.1.1.1 SubSubHeading</h4>
</div>

Above is an example of how the html will look like, so I want to group togheter the parent (h2) with it's children (h3, h4) and when a new h2 is found in the DOM I want to make a new group containing it's "children".

Upvotes: 2

Views: 2851

Answers (1)

er-sho
er-sho

Reputation: 9771

You have two ways to retrieve all text inside <h1> to <h6> tag

So your input html is:

string input = @"<div class='main - body'>
                     <h2> 1.1 Heading </h2>     
                     <h3> 1.1.1 Subheading </h3>        
                     <p> Lorem ipsum </ p >           
                     <h3> 1.1.2 Another Subheading</h3>              
                     <p> Lorem ipsum </p>     
                     <h2> 2.1 Heading </h2>
                     <h3> 2.1.1 Subheading </h3>
                     <p> Lorem ipsum </p>
                     <h4> 2.1.1.1 SubSubHeading </h4>
                 </div> ";

1) By using Regex:

Use this regex to get all text inside heading tag from <h1> to <h6>

<h[1-6][^>]*?>(?<TagText>.*?)</h[1-6]>

Usage:

string pattern = @"<h[1-6][^>]*?>(?<TagText>.*?)</h[1-6]>";

MatchCollection matches = Regex.Matches(input, pattern);

var heading_matches = matches.Cast<Match>().Select(x => x.Groups["TagText"].Value);

To neglect h1 then use

string pattern = @"<h[2-6][^>]*?>(?<TagText>.*?)</h[2-6]>";

2) By using HtmlAgilityPack:

Use HtmlAgilityPack pack to retrieve all text inside <h1> to <h6>.

You need to install this package from NuGet Package Manager Console.

Install-Package HtmlAgilityPack -Version 1.8.14

Usage:

var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(input);

string xpathQuery = "//*[starts-with(name(),'h') and string-length(name()) = 2 and number(substring(name(), 2)) <= 6]";

var texts = htmlDocument.
                DocumentNode
                .SelectNodes(xpathQuery)
                .Select(x => x.InnerText)
                .ToList();

To neglect h1 then use

string xpathQuery = "//*[starts-with(name(),'h') and string-length(name()) = 2 and number(substring(name(), 2)) > 1 and number(substring(name(), 2)) <= 6]";

Output: (From Debugger)

enter image description here

Upvotes: 4

Related Questions