xbot
xbot

Reputation: 347

How do I remove multiple occurrences of a pattern from a string in Python?

I am interested in removing all occurrences of a pattern in a Python string where the pattern looks like "start-string blah, blah, blah end-string". This is a general problem I'd like to be able to handle. This is the same problem as How can I remove a portion of text from a string whenever it starts with &*( and ends with )(* but in Python and not Java.

How would I solve the same problem in Python?

Assume the string looks like this,

'Bla bla bla <mark asd asd asd /> bla bla bla. Yadda yadda yadda <mark alls lkja /> yadda.'

The start of the block to remove is <mark and the end is />. So I do the following:

import re
mystring = "Bla bla bla <mark asd asd asd /> bla bla bla. Yadda yadda yadda <mark akls lkja /> yadda."
tags = "<mark", "/>"
re.sub('%s.*%s' % tags, '', mystring)

My desired output is

'Bla bla bla  bla bla bla. Yadda yadda yadda  yadda.'

But what I get is

'Bla bla bla  yadda.'

So clearly the command is using the first instance of the opening string and the last occurrence of the end string.

How do I make it match the pattern twice and give me the desired output? This has to be easy but despite searches on "remove multiple occurrences regex Python" and the like I have not found an answer. Thanks.

Upvotes: 3

Views: 1330

Answers (1)

Cory Kramer
Cory Kramer

Reputation: 117856

You basically want to find anything between '<mark' and '/>' so you start with the pattern

r'<mark .* />'

However the .* will be greedy, so to make it non-greedy you need to add a ?, then simply use re.sub to replace those matches with empty string

>>> re.sub(r'<mark .*? />', '', s)
'Bla bla bla  bla bla bla. Yadda yadda yadda  yadda.'

Upvotes: 3

Related Questions