Reputation: 14955

Remove HTML tags and newline characters with Regex

I want to replace html tags and newline characters with a <br> tag. In order to do so, I have used the following code, but it does not replace \r\n.

const newText = text.replace(/<script.*?<\/script>/g, '<br>')
  .replace(/<style.*?<\/style>/g, '<br>')
  .replace(/(<([^>]+)>)/ig, "<br>")
  .replace(/(?:\r\n|\r|\n)/g, '<br>')

An example of the text

<div class="text-danger ng-binding" ng-bind-html="message.causedBy ">javax.xml.ws.soap.SOAPFaultException: Response was of unexpected text/html ContentType.  Incoming portion of HTML stream: \r\n\r\n\r\n\r\n500 - Internal server error.\r\n\r\n\r\n\r\n<div><h1>Server Error</h1></div>\r\n<div>\r\n <div class="\&quot;content-container\&quot;">\r\n  <h2>500 - Internal server error.</h2>\r\n  <h3>There is a problem with the resource you are looking for, and it cannot be displayed.</h3>\r\n </div>\r\n</div>\r\n\r\n\r\n\n\t</div>

I appreciate if you help me. (:

Upvotes: 0

Answers (3)

Steven Spungin

Reputation: 29109

This works for me. Are your CRLFs '\r' one escaped character or two characters, being '\' and 'r'.

If you have HTML elements with characters \n and \r, they are literal, and that would be really odd inside a div unless you are displaying source code. Plain ol' line breaks will end up as expected with a single escape character.

Also ,it's not clear if your source is getting pulled from an element or is static text.

You might have to escape the literal case in your regex.

replace(/(?:\\r\\n|\\r|\\n)/g, '<br>')

const text = `
<div class="text-danger ng-binding" ng-bind-html="message.causedBy ">javax.xml.ws.soap.SOAPFaultException: Response was of unexpected text/html ContentType.  Incoming portion of HTML stream: \r\n\r\n\r\n\r\n500 - Internal server error.\r\n\r\n\r\n\r\n<div><h1>Server Error</h1></div>\r\n<div>\r\n <div class="\&quot;content-container\&quot;">\r\n  <h2>500 - Internal server error.</h2>\r\n  <h3>There is a problem with the resource you are looking for, and it cannot be displayed.</h3>\r\n </div>\r\n</div>\r\n\r\n\r\n\n\t</div>`

const newText = text
  .replace(/<script.*?<\/script>/g, '<br>')
  .replace(/<style.*?<\/style>/g, '<br>')
  .replace(/(<([^>]+)>)/ig, "<br>")
  .replace(/(?:\r\n|\r|\n)/g, '<br>')
  //.replace(/(?:\\r\\n|\\r|\\n)/g, '<br>')
console.log(newText)

const text2 = document.getElementById('text').innerHTML
const newText2 = text2
  .replace(/<script.*?<\/script>/g, '<br>')
  .replace(/<style.*?<\/style>/g, '<br>')
  .replace(/(<([^>]+)>)/ig, "<br>")
  .replace(/(?:\r\n|\r|\n)/g, '<br>')
  //.replace(/(?:\\r\\n|\\r|\\n)/g, '<br>')
console.log(newText2)

<div id='text'>
This

is

<script>// nothing here </script>

a

div

These are literal \r\n\r\n and will not get escaped unless you uncomment the special case.

</div>

Upvotes: 1

Niet the Dark Absol

Reputation: 324690

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML.

And so on.

Instead, you have a parser at your fingertips. Use it!

var tmp = document.createElement('div');
tmp.innerHTML = text;

// replace all start/end tags with <br> for... some reason, I guess!
Array.from(tmp.getElementsByTagName("*")).forEach(function(elem) {
    // ignore <br> tags
    if( elem.nodeName.match(/^br$/i)) {
        // do nothing
    }
    // outright remove <script> and <style>
    else if( elem.nodeName.match(/^(?:script|style)$/i)) {
        elem.parentNode.replaceChild(document.createElement('br'), elem);
    }
    // replace element with its contents and place a <br> before and after
    else {
        elem.parentNode.insertBefore(document.createElement('br'), elem);
        while(elem.firstChild) {
            elem.parentNode.insertBefore(elem.firstChild, elem);
        }
        elem.parentNode.replaceChild(document.createElement('br'), elem);
    }
});

var html = tmp.innerHTML;
// since replacing newlines with <br> is a string operation, go ahead and use regex for that
html = html.replace(/\r?\n/,"<br />");

Upvotes: 1

Michał Turczyn

Reputation: 37367

Just replace meverything that matches that pattern (<[^>]+>|\r|\n) with empty string.

It is simple alternation, where \r is carriage return, \n is newline character (so it surely removes all new line characters which sometimes are imbinations of \r and \n).

<[^>]+> will match every HTML tag.

Upvotes: 0

Remove HTML tags and newline characters with Regex

An example of the text

Answers (3)

Related Questions