Reputation: 83
Issue
I got a problem and unfortunately I'm not too familiar with regex just yet but I'm trying to solve an issue I have with automatically processing text. In reality the issue is a tiny bit more complex than the example I'm going to give below but that's mainly to simplify it as much as possible as the issue lies with my regex abilities.
Say we have a string that contains two different types of patterns. In this case AA and BB and they are at random locations in the string. These patterns could be present zero or more times in complete random order.
For example:
"Hello, this AAis just a BB test string. I'm AA here to test BB the regex."
What I want to do is search and replace the word "test" with the word "fix" based on the following two rules:
Example:
So in the above example the word "test" shows up twice.
The first part is: "Hello, this AAis just a BB test"
Rule number 3 applies and passes. Both patterns are found before the "test" and it ends with BB.
The second part is: Hello, this AAis just a BB test string. I'm AA here to test"
Here rule number 3 applies but does not pass.
The final result being:
"Hello, this AAis just a BB fix string. I'm AA here to test BB the regex."
Different solution:
Now, there are other ways to achieve this. For example, count how many times "test" is in a string and do some for loop where I keep track which pattern came last (if they exist) until I find "test" and take action based on which one came last. Repeat this process until all cases of "test" are found but this feels really inefficient.
My attempt at a regex solution
Initially, my issue was that everything was greedy. So [AA]*.*[BB]*.[^AA]+test
resulted in everything until the last 'test' in the string when I just wanted the match up until the first "test" match and slowly iterate until I got to the last one.
So, I modified it to: [AA]*?.*[BB]+?[^AA]*?test?
Based on regex documentation appending a ?
makes it non-greedy.
This is almost what I want, rule 2 and 3 are covered but this won't work for rule 1. So I'm not quite sure how to fix this regex pattern.
Also, how would I iterate my regex pattern over the entire string AND use re.sub to replace the words when needed?
Any help is greatly appreciated.
Upvotes: 0
Views: 207
Reputation: 1496
I don't think trying to build a single regular expression to do everything is going to be a fruitful approach. Instead, let's use multiple regular expressions and a little programming to solve the problem:
def replace_test(string):
aa_locs = [(m.start(), "aa") for m in re.finditer(AA, string)]
bb_locs = [(m.start(), "bb") for m in re.finditer(BB, string)]
merged = sorted(aa_locs + bb_locs + [len(string), "end"])
start = 0
result = ""
replacing = False
for end, pattern_type in merged:
if replacing:
result += string[start:end].replace("test", "fix")
else:
result += string[start:end]
if pattern_type == "bb":
replacing = True
start = end
return result
It is a bit complex and could probably be cleaned up, but let me explain what this code does. First, we want to make a list of every time the state can change in order to break the string into regions where we will be replacing the word "test" and regions where we do not. We get a list of each time AA is found and a list of each time BB is found. We store these as tuples (index, pattern)
. That way we know where there is a possible state change. After that I merged these into a singled list. I also added a sentinel value that we will need to make sure we actually copy the whole string later on.
We know the initial state is not to replace and we start at the beginning of the string. In each iteration we take a portion of the string and add it to the result. After doing this we update the state based on what pattern we have just matched "aa" or "bb".
Upvotes: 1