Avoiding duplicate deletion in python regex

Question

I am removing lines from a bunch of txt files using regex using Python; however, I came across a case where it sometimes does a duplicate deletion if the line contains a string fairly similar to the first one

s = 'market.fruit.apple'

The txt file might contain the following lines

market.fruit.apple
market.fruit.apple.all

But if I run

open_file = open('test.txt', 'r')
read_file = open_file.read()
r = re.compile(r"(?



it removes both market.fruit.apple and market.fruit.apple.all when only the first one should be removed. How do I avoid it? I tried setting the count parameter to 1 but that didn't do anything. I was thinking of doing a string similarity between the strings and use a different regex if it matches the right condition but I figured this might be unneccesary overhead if I scale this up. 

Edit: Corrected some typos in the example above, can be repro in regex101.com/r/q7qWVh/1

Wiktor Stribiżew · Accepted Answer

You may use

r"(?



Note the use re.escape, it is necessary since you are using a variable representing literal text into the regex pattern.

If your variable is market.fruit.apple, your regex will look like

(?


See the regex demo

Details


(? - a left-hand whitespace boundary

market\.fruit\.apple - the keyword
[\s=] - a whitespace or = char
.* - any 0 or more chars other than line break chars as many as possible
(?:[
]\s*)? - an optional sequence of a CR or LF line break char and then any 0 or more whitespaces.

Avoiding duplicate deletion in python regex

Answers (2)

Related Questions