Aadit M Shah
Aadit M Shah

Reputation: 74204

How do I combine these two regular expressions into one?

I'm writing a rudimentary lexer using regular expressions in JavaScript and I have two regular expressions (one for single quoted strings and one for double quoted strings) which I wish to combine into one. These are my two regular expressions (I added the ^ and $ characters for testing purposes):

var singleQuotedString = /^'(?:[^'\\]|\\'|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*'$/gi;
var doubleQuotedString = /^"(?:[^"\\]|\\"|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*"$/gi;

Now I tried to combine them into a single regular expression as follows:

var string = /^(["'])(?:[^\1\\]|\\\1|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*\1$/gi;

However when I test the input "Hello"World!" it returns true instead of false:

alert(string.test('"Hello"World!"')); //should return false as a double quoted string must escape double quote characters

I figured that the problem is in [^\1\\] which should match any character besides matching group \1 (which is either a single or a double quote - the delimiter of the string) and \\ (which is the backslash character).

The regular expression correctly filters out backslashes and matches the delimiters, but it doesn't filter out the delimiter within the string. Any help will be greatly appreciated. Note that I referred to Crockford's railroad diagrams to write the regular expressions.

Upvotes: 3

Views: 5213

Answers (3)

user557597
user557597

Reputation:

This should work too (raw regex).
If speed is a factor, this is the 'unrolled' method, said to be the fastest for this kind of thing.

(['"])(?:(?!\\|\1).)*(?:\\(?:[\/bfnrt]|u[0-9A-F]{4}|\1)(?:(?!\\|\1).)*)*/1  

Expanded

(['"])            # Capture a quote
(?:
   (?!\\|\1).             # As many non-escape and non-quote chars as possible
)*

(?:                       
    \\                     # escape plus,
    (?:
        [\/bfnrt]          # /,b,f,n,r,t or u[a-9A-f]{4} or captured quote
      | u[0-9A-F]{4}
      | \1
    )
    (?:                
        (?!\\|\1).         # As many non-escape and non-quote chars as possible
    )*
)*

/1                # Captured quote

Upvotes: 2

Bart Kiers
Bart Kiers

Reputation: 170158

You can't refer to a matched group inside a character class: (['"])[^\1\\]. Try something like this instead:

(['"])((?!\1|\\).|\\[bnfrt]|\\u[a-fA-F\d]{4}|\\\1)*\1

(you'll need to add some more escapes, but you get my drift...)

A quick explanation:

(['"])             # match a single or double quote and store it in group 1
(                  # start group 2
  (?!\1|\\).       #   if group 1 or a backslash isn't ahead, match any non-line break char
  |                #   OR
  \\[bnfrt]        #   match an escape sequence
  |                #   OR
  \\u[a-fA-F\d]{4} #   match a Unicode escape
  |                #   OR
  \\\1             #   match an escaped quote
)*                 # close group 2 and repeat it zero or more times
\1                 # match whatever group 1 matched

Upvotes: 7

hugomg
hugomg

Reputation: 69934

Well, you can always just create a larger regex by just using the alternation operator on the smaller regexes

/(?:single-quoted-regex)|(?:double-quoted-regex)/

Or explicitly:

var string = /(?:^'(?:[^'\\]|\\'|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*'$)|(?:^"(?:[^"\\]|\\"|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*"$)/gi;

Finally, if you want to avoid the code duplication, you can build up this regex dynamically, using the new Regex constructor.

var quoted_string = function(delimiter){
    return ('^' + delimiter + '(?:[^' + delimiter + '\\]|\\' + delimiter + '|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*' + delimiter + '$').replace(/\\/g, '\\\\');
    //in the general case you could consider using a regex excaping function to avoid backslash hell.
};

var string = new RegExp( '(?:' + quoted_string("'") + ')|(?:' + quoted_string('"') + ')' , 'gi' );

Upvotes: 0

Related Questions