mightymax
mightymax

Reputation: 431

Splitting Regex into Array

I have a string of elements on multiple lines (but i can change this to being all on one line if necessary) and I want to split it on the <section> element. I thought this would be easy, just str.split(regex), or even str.split('<section') but it's not working. It never breaks the sections out.

I've tried using a regular expression SecRegex = /<section.?>[\s\S]?</section>/; var fndSection = result.split(SecRegex);

Tried var fndSection = result.split('<section');

I've looked all over the net and from what I've found one of the two methods above should have worked.

result = '

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap2"> <title>THEORY</title>
<section id="Thoery">
<title>theory Section</title>
<para0 verstatus="ver">
<title>Theory Para 0 </title>
<text>blah blah</text>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter> <title>Chapter Title</title>
<section id="Section ID">
<title>Section Title</title>
<para0>
<title>Para0 Title</title>
<para>blah blah</para>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<line>Title</line>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<list>Title</list>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<ipbchap>
<tags></tags>
</ipbchap>

</body>
<rear>
<tags></tags>
</rear>
</doc>'

Code

SecRegex = /<section.*?>[\s\S]*?<\/section>/;
var fndSection = result.split(SecRegex);

console.log("result string " + fndSection);

This is the result I'm getting from the code I have

result string <chapter id="chap2"> <title>THEORY</title> , , , , <chapter id="chap1"> <para0> <title></title></para0> </chapter> 
result string <chapter id="chap1"> <para0> <title></title></para0> </chapter> 
result string <chapter

As you can see

What I want is a string of <section>.*?</section> into an array

Thank you everyone for looking at this and helping me. I appreciate all your help.

Maxine

Upvotes: 1

Views: 108

Answers (3)

zer00ne
zer00ne

Reputation: 43880

Do not use RegEx on HTML (or any cousin of HTML). Collect your <section>s into a NodeList. Convert that NodeList into an Array. Convert each Node into a String. This could be done in one line:

const strings = Array.from(document.querySelectorAll('section')).map(section => section.outerHTML);

The following demo is a breakdown of the example above.

// Collect all <section>s into a NodeList
const sections = document.querySelectorAll('section');

// Convert NodeList into an Array
const array = Array.from(sections);

/*
Iterate through Array -- on each <section>...
convert it into a String
*/
const strings = array.map(section => section.outerHTML);

// View array as a template literal for a cleaner look
console.log(`${strings}`);

// Verifying it's an array of mutiple elements
console.log(strings.length);

// Verifying that they are in fact strings
console.log(typeof strings[0]);
<chapter id="chap1">
  <para0>
    <title></title>
  </para0>
</chapter>

<chapter id="chap2">
  <title>THEORY</title>
  <section id="Thoery">
    <title>theory Section</title>
    <para0 verstatus="ver">
      <title>Theory Para 0 </title>
      <text>blah blah</text>
    </para0>
  </section>

  <section id="Next section">
    <title>title</title>
    <para0>
      <title>Title</title>
      <text>blah blah</text>
    </para0>
  </section>

  <section id="More sections">
    <title>title</title>
    <para0>
      <title>Title</title>
      <text>blah blah</text>
    </para0>
  </section>

  <section id="section">
    <title>title</title>
    <para0>
      <title>Title</title>
      <text>blah blah</text>
    </para0>
  </section>

  <chapter id="chap1">
    <para0>
      <title></title>
    </para0>
  </chapter>

  <chapter id="chap1">
    <para0>
      <title></title>
    </para0>
  </chapter>

  <chapter>
    <title>Chapter Title</title>
    <section id="Section ID">
      <title>Section Title</title>
      <para0>
        <title>Para0 Title</title>
        <para>blah blah</para>
      </para0>
    </section>

    <section id="Next section">
      <title>title</title>
      <para0>
        <line>Title</line>
        <text>blah blah</text>
      </para0>
    </section>

    <section id="More sections">
      <title>title</title>
      <para0>
        <list>Title</list>
        <text>blah blah</text>
      </para0>
    </section>

    <section id="section">
      <title>title</title>
      <para0>
        <title>Title</title>
        <text>blah blah</text>
      </para0>
    </section>

    <ipbchap>
      <tags></tags>
    </ipbchap>

Upvotes: 1

Emma
Emma

Reputation: 27723

Your expression looks pretty great! You might just want to slightly modify it, maybe to something similar to:

/<section[a-z="'\s]+>([\s\S]*?)<\/section>/gmi

RegEx

If this wasn't your desired expression, you can modify/change your expressions in regex101.com.

RegEx Circuit

You can also visualize your expressions in jex.im:

enter image description here

JavaScript Test

const regex = /<section[a-z="'\s]+>([\s\S]*?)<\/section>/gmi;
const str = `<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap2"> <title>THEORY</title>
<section id="Thoery">
<title>theory Section</title>
<para0 verstatus="ver">
<title>Theory Para 0 </title>
<text>blah blah</text>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>`;
const subst = `$1`;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: ', result);


In case you might want to capture the section tags as well, you can simply wrap your entire expression in a capturing group:

const regex = /(<section[a-z="'\s]+>([\s\S]*?)<\/section>)/gmi;
const str = `<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap2"> <title>THEORY</title>
<section id="Thoery">
<title>theory Section</title>
<para0 verstatus="ver">
<title>Theory Para 0 </title>
<text>blah blah</text>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>`;
const subst = `\n$1\n`;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: \n', result);

Upvotes: 2

VLAZ
VLAZ

Reputation: 28970

You don't need to split the string - you want to extract the data that matches your pattern from it. You can do that using String#match. Note that you need to add the g flag to get all matches:

var result = `<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap2"> <title>THEORY</title>
<section id="Thoery">
<title>theory Section</title>
<para0 verstatus="ver">
<title>Theory Para 0 </title>
<text>blah blah</text>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter> <title>Chapter Title</title>
<section id="Section ID">
<title>Section Title</title>
<para0>
<title>Para0 Title</title>
<para>blah blah</para>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<line>Title</line>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<list>Title</list>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<ipbchap>
<tags></tags>
</ipbchap>

</body>
<rear>
<tags></tags>
</rear>
</doc>`;
// the g flag is added ---------------------↓
SecRegex = /<section.*?>[\s\S]*?<\/section>/g;
var fndSection = result.match(SecRegex);


console.log("result string ", fndSection);

However, you are better off parsing the DOM and extracting the information you want from there - this is simple using DOMParser:

var result = `<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap2"> <title>THEORY</title>
<section id="Thoery">
<title>theory Section</title>
<para0 verstatus="ver">
<title>Theory Para 0 </title>
<text>blah blah</text>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter id="chap1">
<para0><title></title></para0>
</chapter>

<chapter> <title>Chapter Title</title>
<section id="Section ID">
<title>Section Title</title>
<para0>
<title>Para0 Title</title>
<para>blah blah</para>
</para0>
</section>

<section id="Next section">
<title>title</title>
<para0>
<line>Title</line>
<text>blah blah</text>
</para0>
</section>

<section id="More sections">
<title>title</title>
<para0>
<list>Title</list>
<text>blah blah</text>
</para0>
</section>

<section id="section">
<title>title</title>
<para0>
<title>Title</title>
<text>blah blah</text>
</para0>
</section>

<ipbchap>
<tags></tags>
</ipbchap>

</body>
<rear>
<tags></tags>
</rear>
</doc>`

var parser = new DOMParser();
var doc = parser.parseFromString(result, "text/html");

var sections = [...doc.getElementsByTagName("section")];
var fndSection = sections.map(section => section.outerHTML)
console.log(fndSection);

Upvotes: 1

Related Questions