SooWoo
SooWoo

Reputation: 71

Avoiding duplicate deletion in python regex

I am removing lines from a bunch of txt files using regex using Python; however, I came across a case where it sometimes does a duplicate deletion if the line contains a string fairly similar to the first one

s = 'market.fruit.apple'

The txt file might contain the following lines

market.fruit.apple
market.fruit.apple.all

But if I run

open_file = open('test.txt', 'r')
read_file = open_file.read()
r = re.compile(r"(?<!\S){0}.*(?:[\r\n]\s*)?".format(s))
read_file = r.sub('',read_file)
write_file = open('test.txt', 'w')
write_file.write(read_file)

it removes both market.fruit.apple and market.fruit.apple.all when only the first one should be removed. How do I avoid it? I tried setting the count parameter to 1 but that didn't do anything. I was thinking of doing a string similarity between the strings and use a different regex if it matches the right condition but I figured this might be unneccesary overhead if I scale this up.

Edit: Corrected some typos in the example above, can be repro in regex101.com/r/q7qWVh/1

Upvotes: 1

Views: 69

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627335

You may use

r"(?<!\S){0}[\s=].*(?:[\r\n]\s*)?".format(re.escape(s))

Note the use re.escape, it is necessary since you are using a variable representing literal text into the regex pattern.

If your variable is market.fruit.apple, your regex will look like

(?<!\S)market\.fruit\.apple[\s=].*(?:[\r\n]\s*)?

See the regex demo

Details

  • (?<!\S) - a left-hand whitespace boundary
  • market\.fruit\.apple - the keyword
  • [\s=] - a whitespace or = char
  • .* - any 0 or more chars other than line break chars as many as possible
  • (?:[\r\n]\s*)? - an optional sequence of a CR or LF line break char and then any 0 or more whitespaces.

Upvotes: 1

Abdelrahman Abounegm
Abdelrahman Abounegm

Reputation: 715

There are a couple of problems with this RegEx. First, the dot in your string is interpreted as the "any single character" token, not the literal dot. It needs to be escaped with a backslash: \.. Next, the non-capturing group in the end to match whitespace is optional, and the .* before it will just continue matching characters until it finds a new line. I also don't understand the purpose of the first negative lookbehind.

As to how to fix it, here is my suggestion:

1- If you need to compare the line with a string literal and you are not using any of the RegEx features, you can just read the lines and filter them as such:

lines = open_file.readlines()
lines = [line for line in lines if line != s]

2- If you need it in Regex, you can simply replace the non-capturing groups with ^ and $, signifying start and end of line, respectively. The new RegEx will be ^market\.fruit\.apple$, and you can see it in action here: https://regex101.com/r/pi7Wjw/1/

Make sure to also check the re library documentation for more info about how to use the various special symbols.

Upvotes: 0

Related Questions