Eoin Murray
Eoin Murray

Reputation: 1955

Split a string of html into array of string, split by top level tag

Say I have

var string = 
"<h1>Header</h1>
<p>this is a small paragraph</p>
<ul>
    <li>list element 1.</li>
    <li>list element 2.</li>
    <li>list element 3. With a small update.</li>
</ul>"
//newlines for clarity only

How can I split this string, using javascript so that I get

var array = string.split(/*...something here*/)

array = [
"<h1>Header</h1>",
"<p>this is a small paragraph</p>",
"<ul><li>list element 1.</li><li>list element 2.</li><li>list element 3. With a small update.</li></ul>"
]

I only want to split the top html elements, not the children.

Upvotes: 2

Views: 9843

Answers (3)

Robert Plummer
Robert Plummer

Reputation: 652

A performant solution ( http://jsperf.com/spliting-html ):

var splitter = document.createElement('div'),
  text = splitter.innerHTML = "<h1>Header</h1>\
<p>this is a small paragraph</p>\
<ul>\
    <li>list element 1.</li>\
    <li>list element 2.</li>\
    <li>list element 3. With a small update.</li>\
</ul>",
  parts = splitter.children,
  part = parts[0].innerHTML;

Upvotes: 2

Alex Shesterov
Alex Shesterov

Reputation: 27525

You can't do this with regular expressions. Your regular expression will fail if you have several nested elements of the same type, e.g.

<div>
  <div>
    <div>
    </div>
  </div>
</div>

This is due to the fact that regular expressions can only process regular languages, and HTML is a real context-free language (and context-free is "more complex" than regular).

See also: https://stackoverflow.com/a/1732454/2170192

But if you don't have nested elements of the same type, you may split your html-string by taking all matches returned by the following regular expression (which uses backlinks):

/<(\w+).*<\/\1\s*>/igsm
  • <(\w+) matches less-than-sign and several word-characters (letters, digits, underscores), while capturing the word-characters via parentheses (first capturing group).
  • .* matches contents of the element.
  • <\/ matches opening of the end-tag.
  • \1 is the backreference which matches exactly the sequence of symbols captured via the first capturing group.
  • \s*> matches optional whitespace and the greater-than sign.
  • igsm are modifiers: case-insensitive, global, dot-matches-all-symbols and multi-line.

Upvotes: 1

Blender
Blender

Reputation: 298146

You could do something like this:

var string = '<div><p></p></div><h1></h1>';
var elements = $(string).map(function() {
    return $('<div>').append(this).html();  // Basically `.outerHTML()`
});

And the result:

["<h1>Header</h1>", "<p>this is a small paragraph</p>", "<ul>    <li>list element 1.</li>    <li>list element 2.</li>    <li>list element 3. With a small update.</li></ul>"]

Upvotes: 3

Related Questions