Reputation: 3242
I'm not comfortable with regex, so I need your help with this one, which seems tricky to me.
Let's say I've got the following string :
string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
What would be the regex to get title:hello
, title:world
, remove these strings from the original one and leave "title:quoted"
in it, because it's surrounded by double quotes ?
I've already seen this similar SO answer, and here is what I ended up with :
import re
string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
def replace(m):
if m.group(1) is None:
return m.group()
return m.group().replace(m.group(1), "")
regex = r'\"[^\"]title:[^\s]+\"|([^\"]*)'
cleaned_string = re.sub(regex, replace, string)
assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'
Of course, it does not work, and I'm not surprised, because regex are esoteric to me.
Thank you for your help !
Thanks to your answers, here is the final solution, working for my needs :
import re
matches = []
def replace(m):
matches.append(m.group())
return ""
string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = '(?<!")title:[^\s]+(?!")'
cleaned_string = re.sub(regex, replace, string)
# remove extra withespaces
cleaned_string = ' '.join(cleaned_string.split())
assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'
assert matches[0] == "title:hello"
assert matches[1] == "title:world"
Upvotes: 3
Views: 4141
Reputation: 89649
A little violent but works in all situations and without catastrophic backtracking:
import re
string = r'''keyword1 keyword2 title:hello title:world "title:quoted"title:foo
"abcd \" title:bar"title:foobar keyword3 keywordtitle:keyword
"non balanced quote title:foobar'''
pattern = re.compile(
r'''(?:
( # other content
(?:(?=(
" (?:(?=([^\\"]+|\\.))\3)* (?:"|$) # quoted content
|
[^t"]+ # all that is not a "t" or a quote
|
\Bt # "t" preceded by word characters
|
t (?!itle:[a-z]+) # "t" not followed by "itle:" + letters
) )\2)+
)
| # OR
(?<!") # not preceded by a double quote
)
(?:\btitle:[a-z]+)?''',
re.VERBOSE)
print re.sub(pattern, r'\1', string)
Upvotes: 0
Reputation: 180550
re.sub('[^"]title:\w+',"",string)
keyword1 keyword2 "title:quoted" keyword3
Replace any substring starting with title:
followed by any letters -> w+
Upvotes: 1
Reputation: 41848
This situation sounds very similar to "regex-match a pattern unless..."
We can solve it with a beautifully-simple regex:
"[^"]*"|(\btitle:\S+)
The left side of the alternation |
matches complete "double quoted strings"
tags. We will ignore these matches. The right side matches and captures your title:hello
strings to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
This program shows how to use the regex (see the results at the bottom of the online demo):
import re
subject = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = re.compile(r'"[^"]*"|(\btitle:\S+)')
def myreplacement(m):
if m.group(1):
return ""
else:
return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Upvotes: 3
Reputation: 474281
You can check for word boundaries (\b
):
>>> s = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
>>> re.sub(r'\btitle:\w+\b', '', s, re.I)
'keyword1 keyword2 "title:quoted" keyword3'
Or, alternatively, you can use negative look behind and ahead assertions to check for not having quotes around title:\w+
:
>>> re.sub(r'(?<!")title:\w+(?!")', '', s)
'keyword1 keyword2 "title:quoted" keyword3'
Upvotes: 6