Reputation: 2259
In this example I want to select all words, except those between quotes (i.e. "results", "items", "packages", "settings" and "build_type", but not "compiler.version").
results[0].items[0].packages[0].settings["compiler.version"]
results[0].items[0].packages[0].settings.build_type
Here's what I know: I can target all words with
[a-z_]+
and then target what's in between quotes with this:
(?<=\")[\w.]+(?=\")
Is there any way to match the difference between the results of the first and second regex? (i.e. words except if they are surrounded by double quotes)
Here's a regex playground with the example for convenience.
Upvotes: 3
Views: 1463
Reputation: 136
Here is a simpler version which works with the example you provided.
(?<!\")\b[a-z_]+\b(?!\")
Edit: This does work for the example you provided. However, it has some flaws because it only avoids matching words that are touching a "
. Therefore, if you have several words within the quotes, it will match any inner words that are not touching a "
.
Working on improving this solution and will edit this post if new updates develop.
Upvotes: -1
Reputation: 110675
A word is not within a double-quoted substring if and only it is followed in the string by an even number of double-quotes (assuming the string is properly formatted and therefore contains an even number of double-quotes). You can use the following regular expression to match strings that are not contained within double-quoted substrings.
[a-z_]+(?=(?:(?:[^\"\n]*\"){2})*[^\"\n]*$)
The regular expression can be broken down as follows (alternatively, hover the cursor over each part of the expression at the link to obtain an explanation of its function).
[a-z_]+ # match one or more of the indicated characters
(?= # begin a positive lookahead
(?: # begin an outer non-capture group
(?: # begin an inner non-capture group
[^\"\n]* # match zero or more characters other than " and \n
\" # match "
){2} # end inner non-capture group and execute twice
)* # end outer non-capture group and execute zero or more times
[^\"\n]* # match zero or more characters other than " and \n
$ # match end of string
) # end positive lookahead
Upvotes: 3
Reputation: 626758
You can match strings between double quotes and then match and capture words optionally followed with dot separated words:
list(filter(None, re.findall(r'"[^"]*"|([a-z_]\w*(?:\.[a-z_]\w*)*)', text, re.ASCII | re.I)))
See the regex demo. Details:
"[^"]*"
- a "
char, zero or more chars other than "
and then a "
char|
- or([a-z_]\w*(?:\.[a-z_]\w*)*)
- Group 1: a letter or underscore followed with zero or more word chars and then zero or more sequences of a .
and then a letter or underscore followed with zero or more word chars.See the Python demo:
import re
text = 'results[0].items[0].packages[0].settings["compiler.version"] '
print(list(filter(None, re.findall(r'"[^"]*"|([a-z_]\w*(?:\.[a-z_]\w*)*)', text, re.ASCII | re.I))))
# => ['results', 'items', 'packages', 'settings']
The re.ASCII
option is used to make \w
match [a-zA-Z0-9_]
without accounting for Unicode chars.
Upvotes: 4