With javascript: What is the best way to block scripting without block html markups (,
, etc.)?

Question

I want to safely display a text coming from the user (by blocking scripts tags), but I need to accept html markups (b, p, li, ul, etc.).

It's need to be bullet proof against cross-site scripting attack.

Thank you!

Mike Samuel · Accepted Answer

If you have a simple tag whitelist and you don't need to worry about attacks at or below the encoding level (as is the case from within browser-side JavaScript), you can do the following:

function sanitize(tagWhitelist, html) {
  // Get rid of all uses of '['.
  html = String(html).replace(/\[/g, '[');

  // Consider all uses of '<' and replace whitelisted tags with markers like
  // [1] which are indices into a list of approved tag names.
  // Replace all other uses of < and > with entities.
  var tags = [];
  html = html.replace(
    /|<(/?)([a-z]\w*)(?:[^"'>]|"[^"]*"|'[^']*')*>/g,
    function (_, close, tagName) {
      if (tagName) {
        tagName = tagName.toLowerCase();
        if (tagWhitelist.hasOwnProperty(tagName) && tagWhitelist[tagName]) {
          var index = tags.length;
          tags.push('<' + (close || '') + tagName + '>');
          return '[' + index + ']';
        }
      }
      return '';
    });

  // Escape HTML special characters.  Leave entities alone.
  html = html.replace(/[<>"'@\`\u0000]/g,
    function (c) {
      switch (c) {
        case '<': return '<';
        case '>': return '>';
        case '"': return '"';
        case '\'': return ''';
        case '@': return '@';
      }
      return '&#' + c.charCodeAt(0) + ';';
    });
  if (html.indexOf('<') >= 0) { throw new Error(); }  // Sanity check.

  // Throw out any close tags that don't correspond to start tags.
  // If  is used for formatting, embedded HTML shouldn't be able
  // to use a mismatched  to break page layout.
  var open = [];
  for (var i = 0, n = tags.length; i < n; ++i) {
    var tag = tags[i];
    if (tag.charAt(1) === '/') {
      var idx = open.lastIndexOf(tag);
      if (idx < 0) { tags[i] = ""; }  // Drop close tag.
      else {
        tags[i] = open.slice(idx).reverse().join('');
        open.length = idx;
      }
    } else if (!HTML5_VOID_ELEMENTS.test(tag)) {
      open.push(' and  from
  // breaking the layout of containing HTML.
  return html + open.reverse().join('');
}

var HTML5_VOID_ELEMENTS = new RegExp(
     '^<(?:area|base|br|col|command|embed|hr|img|input'
     + '|keygen|link|meta|param|source|track|wbr)\b');
which can be used like
sanitize({ p: true, b: true, i: true, br: true },
         "Hello, World!");
If you need more configurability, like the ability to allow attributes on tags, see the Caja HTML sanitizer.
As others have pointed out, your server should not trust the result coming from the client so you should re-sanitize on the server before embedding the result into server-generated markup.

With javascript: What is the best way to block scripting without block html markups (<b>, <p>, etc.)?

Answers (2)

Related Questions

With javascript: What is the best way to block scripting without block html markups (&lt;b&gt;, &lt;p&gt;, etc.)?

Answers (2)

Related Questions

With javascript: What is the best way to block scripting without block html markups (<b>, <p>, etc.)?