GreenTriangle
GreenTriangle

Reputation: 2460

Split HTML string into sections based on specific tag?

I have a string representing a HTML snippet like this:

const bookString = "<h1>Chapter 1: The Beginning</h1>
<p>It was a dark and stormy night...</p>
<p>Tom ran up the stairs...</p>
<p>A shot rang out!</p>

<h1>Chapter 2: A Day at the Zoo</h1>
<p>The door swung open...</p>"

You get the idea, it's a book where I only expect to see h1, p, em/strong/i/b tags. (This comes from the library Mammoth which takes a Word document and gives me a HTML string.) I want to write some JS that splits it up based on chapter, like so:

const chapters = [
  {
    title: "The Beginning",
    content: 
      "<p>It was a dark and stormy night...</p>
      <p>Tom ran up the stairs...</p>
      <p>A shot rang out!</p>"
    ]
  }
];

Then I can pass that to an ebook-generating library.

Should I use a HTML parsing library like Cheerio to do this? I can't quite figure out selections, like "for each h1, save a title, then for each p following that h1, push to array..." Or should I use regexes, despite the common advice to never use regexes on HTML?

Upvotes: 1

Views: 890

Answers (2)

Patrick Evans
Patrick Evans

Reputation: 42736

If you want to use Cheerio, you can use the nextUntil() method to get all elements up to one identified by a passed selector

//get all elements until the next h1 is encountered
$('h1').nextUntil('h1')

Using that you can then just map() over the h1 collection getting each set of contents and finally create your object

const chapters = $('h1').map((index,h1)=>{
  let content = $(h1).nextUntil('h1').map((index,p)=>$.html(p)).get().join('');
  return {
    title:$(h1).html(),
    content:content
  };
}).get();

repl.it Demo

Upvotes: 3

vsync
vsync

Reputation: 130105

One way would be to use a series of splits to sort the string and break it into parts, and then do some cleanup and build a new Array by mapping the initial "broken" string and internally split again to get the (clean) title & content

var bookString = `<h1>Chapter 1: The Beginning</h1>
<p>It was a dark and stormy night...</p>
<p>Tom ran up the stairs...</p>
<p>A shot rang out!</p>

<h1>Chapter 2: A Day at the Zoo</h1>
<p>The door swung open...</p>`;


var chapters = bookString.split('<h1>').filter(n => n).map(text => {
  var cut = text.replace(/\n/g, '').split(': ')[1].split('</h1>');
  return {
    title   : cut[0],
    content : cut[1]
  }
});

console.log(chapters);

Upvotes: 3

Related Questions