Jo Au
Jo Au

Reputation: 23

With javascript: What is the best way to block scripting without block html markups (<b>, <p>, etc.)?

I want to safely display a text coming from the user (by blocking scripts tags), but I need to accept html markups (b, p, li, ul, etc.).

It's need to be bullet proof against cross-site scripting attack.

Thank you!

Upvotes: 2

Views: 426

Answers (2)

Mike Samuel
Mike Samuel

Reputation: 120516

If you have a simple tag whitelist and you don't need to worry about attacks at or below the encoding level (as is the case from within browser-side JavaScript), you can do the following:

function sanitize(tagWhitelist, html) {
  // Get rid of all uses of '['.
  html = String(html).replace(/\[/g, '[');

  // Consider all uses of '<' and replace whitelisted tags with markers like
  // [1] which are indices into a list of approved tag names.
  // Replace all other uses of < and > with entities.
  var tags = [];
  html = html.replace(
    /<!--[\s\S]*?-->|<(\/?)([a-z]\w*)(?:[^"'>]|"[^"]*"|'[^']*')*>/g,
    function (_, close, tagName) {
      if (tagName) {
        tagName = tagName.toLowerCase();
        if (tagWhitelist.hasOwnProperty(tagName) && tagWhitelist[tagName]) {
          var index = tags.length;
          tags.push('<' + (close || '') + tagName + '>');
          return '[' + index + ']';
        }
      }
      return '';
    });

  // Escape HTML special characters.  Leave entities alone.
  html = html.replace(/[<>"'@\`\u0000]/g,
    function (c) {
      switch (c) {
        case '<': return '&lt;';
        case '>': return '&gt;';
        case '"': return '&quot;';
        case '\'': return '&#39;';
        case '@': return '&#64;';
      }
      return '&#' + c.charCodeAt(0) + ';';
    });
  if (html.indexOf('<') >= 0) { throw new Error(); }  // Sanity check.

  // Throw out any close tags that don't correspond to start tags.
  // If <table> is used for formatting, embedded HTML shouldn't be able
  // to use a mismatched </table> to break page layout.
  var open = [];
  for (var i = 0, n = tags.length; i < n; ++i) {
    var tag = tags[i];
    if (tag.charAt(1) === '/') {
      var idx = open.lastIndexOf(tag);
      if (idx < 0) { tags[i] = ""; }  // Drop close tag.
      else {
        tags[i] = open.slice(idx).reverse().join('');
        open.length = idx;
      }
    } else if (!HTML5_VOID_ELEMENTS.test(tag)) {
      open.push('</' + tag.substring(1));
    }
  }
  // Now html contains no tags or less-than characters that could become
  // part of a tag via a replacement operation and tags only contains
  // approved tags.
  // Reinsert the white-listed tags.
  html = html.replace(
       /\[(\d+)\]/g, function (_, index) { return tags[index]; });

  // Close any still open tags.
  // This prevents unclosed formatting elements like <ol> and <table> from
  // breaking the layout of containing HTML.
  return html + open.reverse().join('');
}

var HTML5_VOID_ELEMENTS = new RegExp(
     '^<(?:area|base|br|col|command|embed|hr|img|input'
     + '|keygen|link|meta|param|source|track|wbr)\\b');

which can be used like

sanitize({ p: true, b: true, i: true, br: true },
         "Hello, <b>World</b>!<script>alert(1337)<\/script>");

If you need more configurability, like the ability to allow attributes on tags, see the Caja HTML sanitizer.

As others have pointed out, your server should not trust the result coming from the client so you should re-sanitize on the server before embedding the result into server-generated markup.

Upvotes: 1

J V
J V

Reputation: 11936

If you are using javascript for user input it won't be bulletproof no matter what you do.

Assuming you're writing a server-side backend, you should use the tried and true bbcode, there must be a library for it.

Upvotes: 1

Related Questions