Agate
Agate

Reputation: 3242

Python regex matching pattern not surrounded by double quotes

I'm not comfortable with regex, so I need your help with this one, which seems tricky to me.

Let's say I've got the following string :

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'

What would be the regex to get title:hello, title:world, remove these strings from the original one and leave "title:quoted" in it, because it's surrounded by double quotes ?

I've already seen this similar SO answer, and here is what I ended up with :

import re

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'

def replace(m):
    if m.group(1) is None:
        return m.group()

    return m.group().replace(m.group(1), "")

regex = r'\"[^\"]title:[^\s]+\"|([^\"]*)'
cleaned_string = re.sub(regex, replace, string)

assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'

Of course, it does not work, and I'm not surprised, because regex are esoteric to me.

Thank you for your help !

Final solution

Thanks to your answers, here is the final solution, working for my needs :

import re
matches = []

def replace(m):
    matches.append(m.group())
    return ""

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = '(?<!")title:[^\s]+(?!")'
cleaned_string = re.sub(regex, replace, string)

# remove extra withespaces
cleaned_string = ' '.join(cleaned_string.split())

assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'
assert matches[0] == "title:hello"
assert matches[1] == "title:world"

Upvotes: 3

Views: 4141

Answers (4)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89649

A little violent but works in all situations and without catastrophic backtracking:

import re

string = r'''keyword1 keyword2 title:hello title:world "title:quoted"title:foo
       "abcd \" title:bar"title:foobar keyword3 keywordtitle:keyword
       "non balanced quote title:foobar'''

pattern = re.compile(
    r'''(?:
            (      # other content
                (?:(?=(
                    " (?:(?=([^\\"]+|\\.))\3)* (?:"|$) # quoted content
                  |
                    [^t"]+             # all that is not a "t" or a quote
                  |
                    \Bt                # "t" preceded by word characters
                  |
                    t (?!itle:[a-z]+)  # "t" not followed by "itle:" + letters 
                )  )\2)+
            )
          |     # OR
            (?<!") # not preceded by a double quote
        )
        (?:\btitle:[a-z]+)?''',
    re.VERBOSE)

print re.sub(pattern, r'\1', string)

Upvotes: 0

Padraic Cunningham
Padraic Cunningham

Reputation: 180550

 re.sub('[^"]title:\w+',"",string)
keyword1 keyword2 "title:quoted" keyword3

Replace any substring starting with title:followed by any letters -> w+

Upvotes: 1

zx81
zx81

Reputation: 41848

This situation sounds very similar to "regex-match a pattern unless..."

We can solve it with a beautifully-simple regex:

"[^"]*"|(\btitle:\S+)

The left side of the alternation | matches complete "double quoted strings" tags. We will ignore these matches. The right side matches and captures your title:hello strings to Group 1, and we know they are the right ones because they were not matched by the expression on the left.

This program shows how to use the regex (see the results at the bottom of the online demo):

import re
subject = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = re.compile(r'"[^"]*"|(\btitle:\S+)')
def myreplacement(m):
    if m.group(1):
        return ""
    else:
        return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)

Reference

How to match (or replace) a pattern except in situations s1, s2, s3...

Upvotes: 3

alecxe
alecxe

Reputation: 474281

You can check for word boundaries (\b):

>>> s = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
>>> re.sub(r'\btitle:\w+\b', '', s, re.I)
'keyword1 keyword2   "title:quoted" keyword3'

Or, alternatively, you can use negative look behind and ahead assertions to check for not having quotes around title:\w+:

>>> re.sub(r'(?<!")title:\w+(?!")', '', s)
'keyword1 keyword2   "title:quoted" keyword3'

Upvotes: 6

Related Questions