jlo
jlo

Reputation: 2259

Regex match all words except those between quotes

In this example I want to select all words, except those between quotes (i.e. "results", "items", "packages", "settings" and "build_type", but not "compiler.version").

results[0].items[0].packages[0].settings["compiler.version"] 
results[0].items[0].packages[0].settings.build_type

Here's what I know: I can target all words with

[a-z_]+

and then target what's in between quotes with this:

(?<=\")[\w.]+(?=\")

Is there any way to match the difference between the results of the first and second regex? (i.e. words except if they are surrounded by double quotes)

Here's a regex playground with the example for convenience.

Upvotes: 3

Views: 1463

Answers (3)

user17038038
user17038038

Reputation: 136

Here is a simpler version which works with the example you provided.

(?<!\")\b[a-z_]+\b(?!\")

Here's a demo

Edit: This does work for the example you provided. However, it has some flaws because it only avoids matching words that are touching a ". Therefore, if you have several words within the quotes, it will match any inner words that are not touching a ".

Working on improving this solution and will edit this post if new updates develop.

Upvotes: -1

Cary Swoveland
Cary Swoveland

Reputation: 110675

A word is not within a double-quoted substring if and only it is followed in the string by an even number of double-quotes (assuming the string is properly formatted and therefore contains an even number of double-quotes). You can use the following regular expression to match strings that are not contained within double-quoted substrings.

[a-z_]+(?=(?:(?:[^\"\n]*\"){2})*[^\"\n]*$)

Demo

The regular expression can be broken down as follows (alternatively, hover the cursor over each part of the expression at the link to obtain an explanation of its function).

[a-z_]+         # match one or more of the indicated characters
(?=             # begin a positive lookahead
  (?:           # begin an outer non-capture group
    (?:         # begin an inner non-capture group
      [^\"\n]*  # match zero or more characters other than " and \n 
      \"        # match "
    ){2}        # end inner non-capture group and execute twice
  )*            # end outer non-capture group and execute zero or more times
  [^\"\n]*      # match zero or more characters other than " and \n 
  $             # match end of string
)               # end positive lookahead

Upvotes: 3

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626758

You can match strings between double quotes and then match and capture words optionally followed with dot separated words:

list(filter(None, re.findall(r'"[^"]*"|([a-z_]\w*(?:\.[a-z_]\w*)*)', text, re.ASCII | re.I)))

See the regex demo. Details:

  • "[^"]*" - a " char, zero or more chars other than " and then a " char
  • | - or
  • ([a-z_]\w*(?:\.[a-z_]\w*)*) - Group 1: a letter or underscore followed with zero or more word chars and then zero or more sequences of a . and then a letter or underscore followed with zero or more word chars.

See the Python demo:

import re
text = 'results[0].items[0].packages[0].settings["compiler.version"] '
print(list(filter(None, re.findall(r'"[^"]*"|([a-z_]\w*(?:\.[a-z_]\w*)*)', text, re.ASCII | re.I))))
# => ['results', 'items', 'packages', 'settings']

The re.ASCII option is used to make \w match [a-zA-Z0-9_] without accounting for Unicode chars.

Upvotes: 4

Related Questions