Ezio
Ezio

Reputation: 468

Wrap every word in double quotes but ignore the word(s) which is already wrapped in double quotes, using regex

I have a query in which I want to wrap every word in double quotes by ignoring certain attributes, but I also want to ignore the words which are already in double quotes.

I am ignoring has, from, to, sample etc but not able to ignore words in double quotes.

Could anyone please nudge me in the right direction ?

Current regex -

\b(?!\bOR\b)\b(?!\bAND\b)\b(?!\bfrom:\b)\b(?!\bto:\b)\b(?!\bhas:\b)\b(?!\bsample\b)\w+\b

Query -

(@harrys OR from:harrys OR to:harrys OR ("harry's" OR harrys) AND (razor OR razors OR shave OR shaving OR shaved OR shaver OR subscription OR razorhead OR razorheads OR buy OR bought OR buying OR boxers OR cover) AND (has:geo OR has:profile_geo) -styles -prince -markle -meghanmarkle)

Upvotes: 1

Views: 139

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626758

You can match and capture all your exceptions, and just match your expected matches, then, when replacing, check if Group 1 participated in the match, and replace accordingly.

Here is what I mean:

import re
text = r"""(@harrys OR from:harrys OR to:harrys OR ("harry's" OR harrys) AND (razor OR razors OR shave OR shaving OR shaved OR shaver OR subscription OR razorhead OR razorheads OR buy OR bought OR buying OR boxers OR cover) AND (has:geo OR has:profile_geo) -styles -prince -markle -meghanmarkle)"""
pattern = r'("[^"]*"|\b(?:(?:OR|AND|sample)\b|(?:from|to|has):))|\w+'
print( re.sub(pattern, lambda m: f'"{m.group(1)}"' if m.group(1) else m.group(), text) )

Output:

(@harrys "OR" "from:"harrys "OR" "to:"harrys "OR" (""harry's"" "OR" harrys) "AND" (razor "OR" razors "OR" shave "OR" shaving "OR" shaved "OR" shaver "OR" subscription "OR" razorhead "OR" razorheads "OR" buy "OR" bought "OR" buying "OR" boxers "OR" cover) "AND" ("has:"geo "OR" "has:"profile_geo) -styles -prince -markle -mehanmarkle)

See the Python demo. See also the regex demo (all green matches are kept, all blue matches are enclosed with double quotes).

Regex details:

  • ("[^"]*"|\b(?:(?:OR|AND|sample)\b|(?:from|to|has):)) - Group 1 (this text is kept as is, exceptions):
    • "[^"]*" - ", zero or more chars other than ", and a " char
    • | - or
    • \b - a word boundary
    • (?: - start of the non-capturing group
      • (?:OR|AND|sample)\b - OR, AND, sample and a word boundary
      • | - or
      • (?:from|to|has): - from, to, has and a colon
    • ) - end of the non-capturing group
  • | - or
  • \w+ - one or more word chars.

Upvotes: 1

Related Questions