Reputation: 71
I am removing lines from a bunch of txt files using regex using Python; however, I came across a case where it sometimes does a duplicate deletion if the line contains a string fairly similar to the first one
s = 'market.fruit.apple'
The txt file might contain the following lines
market.fruit.apple
market.fruit.apple.all
But if I run
open_file = open('test.txt', 'r')
read_file = open_file.read()
r = re.compile(r"(?<!\S){0}.*(?:[\r\n]\s*)?".format(s))
read_file = r.sub('',read_file)
write_file = open('test.txt', 'w')
write_file.write(read_file)
it removes both market.fruit.apple
and market.fruit.apple.all
when only the first one should be removed. How do I avoid it? I tried setting the count parameter to 1 but that didn't do anything. I was thinking of doing a string similarity between the strings and use a different regex if it matches the right condition but I figured this might be unneccesary overhead if I scale this up.
Edit: Corrected some typos in the example above, can be repro in regex101.com/r/q7qWVh/1
Upvotes: 1
Views: 69
Reputation: 627335
You may use
r"(?<!\S){0}[\s=].*(?:[\r\n]\s*)?".format(re.escape(s))
Note the use re.escape
, it is necessary since you are using a variable representing literal text into the regex pattern.
If your variable is market.fruit.apple
, your regex will look like
(?<!\S)market\.fruit\.apple[\s=].*(?:[\r\n]\s*)?
See the regex demo
Details
(?<!\S)
- a left-hand whitespace boundarymarket\.fruit\.apple
- the keyword[\s=]
- a whitespace or =
char.*
- any 0 or more chars other than line break chars as many as possible(?:[\r\n]\s*)?
- an optional sequence of a CR or LF line break char and then any 0 or more whitespaces.Upvotes: 1
Reputation: 715
There are a couple of problems with this RegEx. First, the dot in your string is interpreted as the "any single character" token, not the literal dot. It needs to be escaped with a backslash: \.
. Next, the non-capturing group in the end to match whitespace is optional, and the .*
before it will just continue matching characters until it finds a new line. I also don't understand the purpose of the first negative lookbehind.
As to how to fix it, here is my suggestion:
1- If you need to compare the line with a string literal and you are not using any of the RegEx features, you can just read the lines and filter them as such:
lines = open_file.readlines()
lines = [line for line in lines if line != s]
2- If you need it in Regex, you can simply replace the non-capturing groups with ^
and $
, signifying start and end of line, respectively. The new RegEx will be ^market\.fruit\.apple$
, and you can see it in action here: https://regex101.com/r/pi7Wjw/1/
Make sure to also check the re library documentation for more info about how to use the various special symbols.
Upvotes: 0