Reputation: 3834
I have an HTML string such as:
<p>
<strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.
</p>
I want to convert this into a JavaScript array that looks like:
['<p>', '<strong>', '<em>', 'Lorem Ipsum ', '</em>', '</strong>', 'is simply dummy text of the printing ', '<em>', 'and', '</em>', 'typesetting industry.', '</p>']
I.e. it takes the HTML string and breaks it down into an array of tags and HTML content.
I have tried to use DomParser()
as per this question:
const str = `<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;
const doc = new DOMParser().parseFromString(str, 'text/html');
const arr = [...doc.body.childNodes]
.map(child => child.outerHTML || child.textContent);
However, this simply returns:
['<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>']
I have also tried to search for various Regex based solutions, but haven't been able to find any that break down the string exactly as I require.
Any suggestions?
Thanks
Upvotes: 3
Views: 338
Reputation: 370659
I'd make a recursive function to iterate over a given node and return an array of the text representation of its children:
const str = `<p><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;
const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
const output = [];
for (const child of node.childNodes) {
if (child.nodeType === Node.TEXT_NODE) {
output.push(child.textContent);
} else if (child.nodeType === Node.ELEMENT_NODE) {
output.push(`<${child.tagName}>`);
output.push(...parseNode(child));
output.push(`</${child.tagName}>`);
}
}
return output;
};
console.log(parseNode(doc.body));
If you need to keep attributes too, you could take the outerHTML
of the element and take the leading non-brackets:
const str = `<p style="color:green"><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;
const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
const output = [];
for (const child of node.childNodes) {
if (child.nodeType === Node.TEXT_NODE) {
output.push(child.textContent);
} else if (child.nodeType === Node.ELEMENT_NODE) {
const attribs = child.outerHTML.match(/<\s*[^>\s]+([^>]*)/)[1];
output.push(`<${child.tagName}${attribs}>`);
output.push(...parseNode(child));
output.push(`</${child.tagName}>`);
}
}
return output;
};
console.log(parseNode(doc.body));
If you need self-closing tags not to be expanded, check if the outerHTML
of an element contains </
:
const str = `<p style="color:green"><input readonly value="x"/><strong><em>Lorem Ipsum </em></strong>is simply dummy text of the printing <em>and</em> typesetting industry.</p>`;
const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
const output = [];
for (const child of node.childNodes) {
if (child.nodeType === Node.TEXT_NODE) {
output.push(child.textContent);
} else if (child.nodeType === Node.ELEMENT_NODE) {
const attribs = child.outerHTML.match(/<\s*[^>\s]+([^>]*)/)[1];
output.push(`<${child.tagName}${attribs}>`);
if (child.outerHTML.includes('</')) {
// Not self closing:
output.push(...parseNode(child));
output.push(`</${child.tagName}>`);
}
}
}
return output;
};
console.log(parseNode(doc.body));
Upvotes: 2