noob
noob

Reputation: 9202

Match everything but not quoted strings

I want to match everything but no quoted strings.

I can match all quoted strings with this: /(("([^"\\]|\\.)*")|('([^'\\]|\\.)*'))/ So I tried to match everything but no quoted strings with this: /[^(("([^"\\]|\\.)*")|('([^'\\]|\\.)*'))]/ but it doesn't work.

I would like to use only regex because I will want to replace it and want to get the quoted text after it back.

string.replace(regex, function(a, b, c) {
   // return after a lot of operations
});

A quoted string is for me something like this "bad string" or this 'cool string'

So if I input:

he\'re is "watever o\"k" efre 'dder\'4rdr'?

It should output this matches:

["he\'re is ", " efre ", "?"]

And than I wan't to replace them.

I know my question is very difficult but it is not impossible! Nothing is impossible.

Thanks

Upvotes: 1

Views: 582

Answers (3)

Bergi
Bergi

Reputation: 664425

You can't invert a regex. What you have tried was making a character class out of it and invert that - but also for doing that you would have to escape all closing brackets "\]".

EDIT: I would have started with

/(^|" |' ).+?($| "| ')/

This matches anything between the beginning or the end of a quoted string (very simple: a quotation mark plus a blank) and the end of the string or the start of a quoted string (a blank plus a quotation mark). Of course this doesn't handle any escape sequences or quotations which don't follow the scheme / ['"].*['"] /. See above answers for more detailed expressions :-)

Upvotes: -4

Tim Pietzcker
Tim Pietzcker

Reputation: 336128

EDIT: Rewritten to cover more edge cases.

This can be done, but it's a bit complicated.

result = subject.match(/(?:(?=(?:(?:\\.|"(?:\\.|[^"\\])*"|[^\\'"])*'(?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*')*(?:\\.|"(?:\\.|[^"\\])*"|[^\\'])*$)(?=(?:(?:\\.|'(?:\\.|[^'\\])*'|[^\\'"])*"(?:\\.|'(?:\\.|[^'"\\])*'|[^\\"])*")*(?:\\.|'(?:\\.|[^'\\])*'|[^\\"])*$)(?:\\.|[^\\'"]))+/g);

will return

, he said. 
, she replied. 
, he reminded her. 
, 

from this string (line breaks added and enclosing quotes removed for clarity):

"Hello", he said. "What's up, \"doc\"?", she replied. 
'I need a 12" crash cymbal', he reminded her. 
"2\" by 4 inches", 'Back\"\'slashes \\ are OK!'

Explanation: (sort of, it's a bit mindboggling)

Breaking up the regex:

(?:
 (?=      # Assert even number of (relevant) single quotes, looking ahead:
  (?:
   (?:\\.|"(?:\\.|[^"\\])*"|[^\\'"])*
   '
   (?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*
   '
  )*
  (?:\\.|"(?:\\.|[^"\\])*"|[^\\'])*
  $
 )
 (?=      # Assert even number of (relevant) double quotes, looking ahead:
  (?:
   (?:\\.|'(?:\\.|[^'\\])*'|[^\\'"])*
   "
   (?:\\.|'(?:\\.|[^'"\\])*'|[^\\"])*
   "
  )*
  (?:\\.|'(?:\\.|[^'\\])*'|[^\\"])*
  $
 )
 (?:\\.|[^\\'"]) # Match text between quoted sections
)+

First, you can see that there are two similar parts. Both these lookahead assertions ensure that there is an even number of single/double quotes in the string ahead, disregarding escaped quotes and quotes of the opposite kind. I'll show it with the single quotes part:

(?=                   # Assert that the following can be matched:
 (?:                  # Match this group:
  (?:                 #  Match either:
   \\.                #  an escaped character
  |                   #  or
   "(?:\\.|[^"\\])*"  #  a double-quoted string
  |                   #  or
   [^\\'"]            #  any character except backslashes or quotes
  )*                  # any number of times.
  '                   # Then match a single quote
  (?:\\.|"(?:\\.|[^"'\\])*"|[^\\'])*'   # Repeat once to ensure even number,
                      # (but don't allow single quotes within nested double-quoted strings)
 )*                   # Repeat any number of times including zero
 (?:\\.|"(?:\\.|[^"\\])*"|[^\\'])*      # Then match the same until...
 $                    # ... end of string.
)                     # End of lookahead assertion.

The double quotes part works the same.

Then, at each position in the string where these two assertions succeed, the next part of the regex actually tries to match something:

(?:      # Match either
 \\.     # an escaped character
|        # or
 [^\\'"] # any character except backslash, single or double quote
)        # End of non-capturing group

The whole thing is repeated once or more, as many times as possible. The /g modifier makes sure we get all matches in the string.

See it in action here on RegExr.

Upvotes: 9

ridgerunner
ridgerunner

Reputation: 34395

Here is a tested function that does the trick:

function getArrayOfNonQuotedSubstrings(text) {
    /*  Regex with three global alternatives to section the string:
          ('[^'\\]*(?:\\[\S\s][^'\\]*)*')  # $1: Single quoted string.
        | ("[^"\\]*(?:\\[\S\s][^"\\]*)*")  # $2: Double quoted string.
        | ([^'"\\]*(?:\\[\S\s][^'"\\]*)*)  # $3: Un-quoted string.
    */
    var re = /('[^'\\]*(?:\\[\S\s][^'\\]*)*')|("[^"\\]*(?:\\[\S\s][^"\\]*)*")|([^'"\\]*(?:\\[\S\s][^'"\\]*)*)/g;
    var a = [];                 // Empty array to receive the goods;
    text = text.replace(re,     // "Walk" the text chunk-by-chunk.
        function(m0, m1, m2, m3) {
            if (m3) a.push(m3); // Push non-quoted stuff into array.
            return m0;          // Return this chunk unchanged.
        });
    return a;
}

This solution uses the String.replace() method with a replacement callback function to "walk" the string section by section. The regex has three global alternatives, one for each section; $1: single quoted, $2: double quoted, and $3: non-quoted substrings, Each non-quoted chunk is pushed onto the return array. It correctly handles all escaped characters, including escaped quotes, both inside and outside quoted strings. Single quoted substrings may contain any number of double quotes and vice-versa. Illegal orphan quotes are removed and serve to divide a non-quoted section into two chunks. Note that this solution requires no lookaround and requires only one pass. It also implements Friedl's "Unrolling-the-Loop" efficiency technique and is quite efficient.

Additional: Here is some code to test the function with the original test string:

// The original test string (with necessary escapes):
var s = "he\\'re is \"watever o\\\"k\" efre 'dder\\'4rdr'?";
alert(s); // Show the test string without the extra backslashes.
console.log(getArrayOfNonQuotedSubstrings(s).toString());

Upvotes: 1

Related Questions