Reputation: 13412

Improve JavaScript regex to match content inside of tags with or without closing tag, excluding self

Preface: I'm aware about general consensus standing against using regex to parse HTML. Asking you in advance, please avoid any recommendations in this regard.

Explanations.

I have the following regex

/<div class="panel-body">([^]*?)(<\/div>|$)/gi

It matches all content, including self, inside of the the div with class .panel-body

Full match:

<div class="panel-body">
   <a href="#">Link</a>
   Line 1
   Line 2
   Line 3
</div>

.. it also matches content with no closing div tag.

Full match:

<div class="panel-body">
   <a href="#">Link</a>
   Line 1
   Line 2
   Line 3
   Don't match after closing `div`...but match this and below in case closing `div` is removed.
   Line below 1
   Line below 2
   Line below 3

Question.

How could I improve my regex to do the following:

Not include in the full match <div class="panel-body"> and closing </div> (when there is closing div tag)
To do this straight (if possible) into the full match without using groups

regex101.com example

Edit 1:

The string doesn't start with <div class="panel-body">, it starts with

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Webmin 1.851 on centos.centos (CentOS Linux 7.3.1611)</title>
</head>
<body>
<div>
<div>
<div class="panel-body">

* Note: It's never closed until the full load as it's progressive output.

Edit 2:

After posted answers, I made speed comparison tests. It's up to you, whose solution would serve best for you.

Speed-test Results

Upvotes: 4

Answers (4)

RoToRa

Reputation: 38431

Does it have to be a regexp? You could just look for the opening tag and optionally drop the closing tag, if present:

function parseContent(input) {
  var openingTag = '<div class="panel-body">';

  var i = input.indexOf(openingTag);
  if (i == -1) {
    return ""; // Or something else
  }

  var closingTag = '</div>';
  var closingTagLength = closingTag.length;
  var end = input.length - (input.slice(-closingTagLength) === closingTag ? closingTagLength : 0);

  return input.slice(i + openingTag.length, end);
}

EDIT:

If there can be text after the closing tag, then just use indexOf there too:

function parseContent(input) {
  var openingTag = '<div class="panel-body">';

  var i = input.indexOf(openingTag);
  if (i == -1) {
    return ""; // Or something else
  }

  var closingTag = '</div>';

  var endIndex = input.indexOf(closingTag, i);
  var end = (endIndex === -1 ? input.length : endIndex);

  return input.slice(i + openingTag.length, end);
}

Upvotes: 2

anubhava

Reputation: 785471

You can use a DOM parser, that should with incomplete tags as well:

function divContent(str) {
  // create a new dov container
  var div = document.createElement('div');

  // assign your HTML to div's innerHTML
  div.innerHTML = '<html>' + str + '</html>';

  // find an element by given className
  var el = div.getElementsByClassName("panel-body");
  
  // return found element's first innerHTML
  return (el.length > 0 ? el[el.length-1].innerHTML : "");
}

// extract text from a complete tag:
var html = `<div class="panel-body">
   <a href="#">Link</a>
   Line 1
   Line 2
   Line 3
</div>`;
console.log(divContent(html));

// extract text from an incomplete tag:
html = `<div class="panel-body">
   <a href="#">Link</a>
   Line 1
   Line 2
   Line 3
   Don't match after closing 'div'...but match this and below
   in case closing 'div' is removed.
   Line below 1
   Line below 2
   Line below 3`;   
console.log(divContent(html));

// OP'e edited HTML text
html = `<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Webmin 1.851 on centos.centos (CentOS Linux 7.3.1611)</title>
</head>
<body>
<div>
<div>
<div class="panel-body">`;
console.log(divContent(html));

JS Fiddle

Upvotes: 3

Ori Marko

Reputation: 58812

If no tags, you can use - all lines not start by < character

(^|\r|\n|\r\n)[^<]+

For specific example getting the first line with

\<[^div] ([^\r\n]*\n)+

If there are other lines after you will need to put the last characters to end it:

\<[^div] ([^\r\n]*\n)+Line 3

Upvotes: 1

Jordan Maduro

Reputation: 1008

I can't comment yet so I will try an answer. How about non-capturing groups, You still have it in the full match, but your only entry in matches would be the content. so index 0.

(?:<div class="panel-body">)([^]*?)(?:<\/div>|$)

https://regex101.com/r/OJf1Rt/3