Using JavaScript, how do I transform an HTML string into an array of HTML tags and text content?

Question

I have an HTML string such as:


    Lorem Ipsum is simply dummy text of the printing and typesetting industry.

I want to convert this into a JavaScript array that looks like:

['', '', '', 'Lorem Ipsum ', '', '', 'is simply dummy text of the printing ', '', 'and', '', 'typesetting industry.', '']

I.e. it takes the HTML string and breaks it down into an array of tags and HTML content.

I have tried to use DomParser() as per this question:

const str = `Lorem Ipsum is simply dummy text of the printing and typesetting industry.`;

const doc = new DOMParser().parseFromString(str, 'text/html');
const arr = [...doc.body.childNodes]
  .map(child => child.outerHTML || child.textContent);

However, this simply returns:

['Lorem Ipsum is simply dummy text of the printing and typesetting industry.']

I have also tried to search for various Regex based solutions, but haven't been able to find any that break down the string exactly as I require.

Any suggestions?

Thanks

CertainPerformance · Accepted Answer

I'd make a recursive function to iterate over a given node and return an array of the text representation of its children:

const str = `Lorem Ipsum is simply dummy text of the printing and typesetting industry.`;

const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
  const output = [];
  for (const child of node.childNodes) {
    if (child.nodeType === Node.TEXT_NODE) {
      output.push(child.textContent);
    } else if (child.nodeType === Node.ELEMENT_NODE) {
      output.push(`<${child.tagName}>`);
      output.push(...parseNode(child));
      output.push(``);
    }
  }
  return output;
};
console.log(parseNode(doc.body));

If you need to keep attributes too, you could take the outerHTML of the element and take the leading non-brackets:

const str = `Lorem Ipsum is simply dummy text of the printing and typesetting industry.`;

const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
  const output = [];
  for (const child of node.childNodes) {
    if (child.nodeType === Node.TEXT_NODE) {
      output.push(child.textContent);
    } else if (child.nodeType === Node.ELEMENT_NODE) {
      const attribs = child.outerHTML.match(/<\s*[^>\s]+([^>]*)/)[1];
      output.push(`<${child.tagName}${attribs}>`);
      output.push(...parseNode(child));
      output.push(``);
    }
  }
  return output;
};
console.log(parseNode(doc.body));

If you need self-closing tags not to be expanded, check if the outerHTML of an element contains :




const str = `Lorem Ipsum is simply dummy text of the printing and typesetting industry.`;

const doc = new DOMParser().parseFromString(str, 'text/html');
const parseNode = node => {
  const output = [];
  for (const child of node.childNodes) {
    if (child.nodeType === Node.TEXT_NODE) {
      output.push(child.textContent);
    } else if (child.nodeType === Node.ELEMENT_NODE) {
      const attribs = child.outerHTML.match(/<\s*[^>\s]+([^>]*)/)[1];
      output.push(`<${child.tagName}${attribs}>`);
      if (child.outerHTML.includes('`);
      }
    }
  }
  return output;
};
console.log(parseNode(doc.body));

Using JavaScript, how do I transform an HTML string into an array of HTML tags and text content?

Answers (1)

Related Questions