Reputation: 347
I am interested in removing all occurrences of a pattern in a Python string where the pattern looks like "start-string
blah, blah, blah end-string
". This is a general problem I'd like to be able to handle. This is the same problem as How can I remove a portion of text from a string whenever it starts with &*( and ends with )(* but in Python and not Java.
How would I solve the same problem in Python?
Assume the string looks like this,
'Bla bla bla <mark asd asd asd /> bla bla bla. Yadda yadda yadda <mark alls lkja /> yadda.'
The start of the block to remove is <mark
and the end is />
. So I do the following:
import re
mystring = "Bla bla bla <mark asd asd asd /> bla bla bla. Yadda yadda yadda <mark akls lkja /> yadda."
tags = "<mark", "/>"
re.sub('%s.*%s' % tags, '', mystring)
My desired output is
'Bla bla bla bla bla bla. Yadda yadda yadda yadda.'
But what I get is
'Bla bla bla yadda.'
So clearly the command is using the first instance of the opening string and the last occurrence of the end string.
How do I make it match the pattern twice and give me the desired output? This has to be easy but despite searches on "remove multiple occurrences regex Python" and the like I have not found an answer. Thanks.
Upvotes: 3
Views: 1330
Reputation: 117856
You basically want to find anything between '<mark'
and '/>'
so you start with the pattern
r'<mark .* />'
However the .*
will be greedy, so to make it non-greedy you need to add a ?
, then simply use re.sub
to replace those matches with empty string
>>> re.sub(r'<mark .*? />', '', s)
'Bla bla bla bla bla bla. Yadda yadda yadda yadda.'
Upvotes: 3