Reputation: 2460
I have a string representing a HTML snippet like this:
const bookString = "<h1>Chapter 1: The Beginning</h1>
<p>It was a dark and stormy night...</p>
<p>Tom ran up the stairs...</p>
<p>A shot rang out!</p>
<h1>Chapter 2: A Day at the Zoo</h1>
<p>The door swung open...</p>"
You get the idea, it's a book where I only expect to see h1, p, em/strong/i/b tags. (This comes from the library Mammoth which takes a Word document and gives me a HTML string.) I want to write some JS that splits it up based on chapter, like so:
const chapters = [
{
title: "The Beginning",
content:
"<p>It was a dark and stormy night...</p>
<p>Tom ran up the stairs...</p>
<p>A shot rang out!</p>"
]
}
];
Then I can pass that to an ebook-generating library.
Should I use a HTML parsing library like Cheerio to do this? I can't quite figure out selections, like "for each h1
, save a title, then for each p
following that h1
, push to array..." Or should I use regexes, despite the common advice to never use regexes on HTML?
Upvotes: 1
Views: 890
Reputation: 42736
If you want to use Cheerio, you can use the nextUntil()
method to get all elements up to one identified by a passed selector
//get all elements until the next h1 is encountered
$('h1').nextUntil('h1')
Using that you can then just map()
over the h1 collection getting each set of contents and finally create your object
const chapters = $('h1').map((index,h1)=>{
let content = $(h1).nextUntil('h1').map((index,p)=>$.html(p)).get().join('');
return {
title:$(h1).html(),
content:content
};
}).get();
Upvotes: 3
Reputation: 130105
One way would be to use a series of split
s to sort the string and break it into parts, and then do some cleanup and build a new Array by mapping the initial "broken" string and internally split again to get the (clean) title & content
var bookString = `<h1>Chapter 1: The Beginning</h1>
<p>It was a dark and stormy night...</p>
<p>Tom ran up the stairs...</p>
<p>A shot rang out!</p>
<h1>Chapter 2: A Day at the Zoo</h1>
<p>The door swung open...</p>`;
var chapters = bookString.split('<h1>').filter(n => n).map(text => {
var cut = text.replace(/\n/g, '').split(': ')[1].split('</h1>');
return {
title : cut[0],
content : cut[1]
}
});
console.log(chapters);
Upvotes: 3