Zabavsky
Zabavsky

Reputation: 13640

Regular expression for excluding some characters with multiline matching

I want to ensure that the user input doesn't contain characters like <, > or &#, whether it is text input or textarea. My pattern:

var pattern = /^((?!&#|<|>).)*$/m;

The problem is, that it still matches multiline strings from a textarea like

this text matches

though this should not, because of this character <

EDIT:

To be more clear, I need exclude &# combination only, not & or #.

Please suggest the solution. Very grateful.

Upvotes: 1

Views: 6981

Answers (3)

ridgerunner
ridgerunner

Reputation: 34395

Alternate answer to specific question:

anubhava's solution works accurately, but is slow because it must perform a negative lookahead at each and every character position in the string. A simpler approach is to use reverse logic. i.e. Instead of verifying that: /^((?!&#|<|>)[\s\S])*$/ does match, verify that /[<>]|&#/ does NOT match. To illustrate this, lets create a function: hasSpecial() which tests if a string has one of the special chars. Here are two versions, the first uses anubhava's second regex:

function hasSpecial_1(text) {
    // If regex matches, then string does NOT contain special chars.
    return /^((?!&#|<|>)[\s\S])*$/.test(text) ? false : true;
}
function hasSpecial_2(text) {
    // If regex matches, then string contains (at least) one special char.
    return /[<>]|&#/.test(text) ? true : false;
}

These two functions are functionally equivalent, but the second one is probably quite a bit faster.

Note that when I originally read this question, I misinterpreted it to really want to exclude HTML special chars (including HTML entities). If that were the case, then the following solution will do just that.

Test if a string contains HTML special Chars:

It appears that the OP want to ensure a string does not contain any special HTML characters including: <, >, as well as decimal and hex HTML entities such as: &#160;, &#xA0;, etc. If this is the case then the solution should probably also exclude the other (named) type of HTML entities such as: &amp;, &lt;, etc. The solution below excludes all three forms of HTML entities as well as the <> tag delimiters.

Here are two approaches: (Note that both approaches do allow the sequence: &# if it is not part of a valid HTML entity.)

FALSE test using positive regex:

function hasHtmlSpecial_1(text) {
    /* Commented regex:
        # Match string having no special HTML chars.
        ^                  # Anchor to start of string.
        [^<>&]*            # Zero or more non-[<>&] (normal*).
        (?:                # Unroll the loop. ((special normal*)*)
          &                # Allow a & but only if
          (?!              # not an HTML entity (3 valid types).
            (?:            # One from 3 types of HTML entities.
              [a-z\d]+     # either a named entity,
            | \#\d+        # or a decimal entity,
            | \#x[a-f\d]+  # or a hex entity.
            )              # End group of HTML entity types.
            ;              # All entities end with ";".
          )                # End negative lookahead.
          [^<>&]*          # More (normal*).
        )*                 # End unroll the loop.
        $                  # Anchor to end of string.
    */
    var re = /^[^<>&]*(?:&(?!(?:[a-z\d]+|#\d+|#x[a-f\d]+);)[^<>&]*)*$/i;
    // If regex matches, then string does NOT contain HTML special chars.
    return re.test(text) ? false : true;
}

Note that the above regex utilizes Jeffrey Friedl's "Unrolling-the-Loop" efficiency technique and will run very quickly for both matching and non-matching cases. (See his regex masterpiece: Mastering Regular Expressions (3rd Edition))

TRUE test using negative regex:

function hasHtmlSpecial_2(text) {
    /* Commented regex:
        # Match string having one special HTML char.
          [<>]           # Either a tag delimiter
        | &              # or a & if start of
          (?:            # one of 3 types of HTML entities.
            [a-z\d]+     # either a named entity,
          | \#\d+        # or a decimal entity,
          | \#x[a-f\d]+  # or a hex entity.
          )              # End group of HTML entity types.
          ;              # All entities end with ";".
    */
    var re = /[<>]|&(?:[a-z\d]+|#\d+|#x[a-f\d]+);/i;
    // If regex matches, then string contains (at least) one special HTML char.
    return re.test(text) ? true : false;
}

Note also that I have included a commented version of each of these (non-trivial) regexes in the form of a JavaScript comment.

Upvotes: 2

anubhava
anubhava

Reputation: 785128

You're probably not looking for m (multiline) switch but s (DOTALL) switch in Javascript. Unfortunately s doesn't exist in Javascript.

However good news that DOTALL can be simulated using [\s\S]. Try following regex:

/^(?![\s\S]*?(&#|<|>))[\s\S]*$/

OR:

/^((?!&#|<|>)[\s\S])*$/

Live Demo

Upvotes: 2

Andrew Cheong
Andrew Cheong

Reputation: 30273

I don't think you need a lookaround assertion in this case. Simply use a negated character class:

var pattern = /^[^<>&#]*$/m;

If you're also disallowing the following characters, -, [, ], make sure to escape them or put them in proper order:

var pattern = /^[^][<>&#-]*$/m;

Upvotes: 2

Related Questions