Nectar
Nectar

Reputation: 83

Specific regex pattern matching Python

Issue
I got a problem and unfortunately I'm not too familiar with regex just yet but I'm trying to solve an issue I have with automatically processing text. In reality the issue is a tiny bit more complex than the example I'm going to give below but that's mainly to simplify it as much as possible as the issue lies with my regex abilities.

Say we have a string that contains two different types of patterns. In this case AA and BB and they are at random locations in the string. These patterns could be present zero or more times in complete random order.

For example:
"Hello, this AAis just a BB test string. I'm AA here to test BB the regex."

What I want to do is search and replace the word "test" with the word "fix" based on the following two rules:

  1. If only AA patterns are found before "test", then "test" won't be replaced.
  2. If only BB patterns are found before "test", then "test" is replaced with "fix".
  3. If 1 or more AA and 1 or more BB are present before "test", then of these multiple patterns the BB pattern must come last. If this is the case, "test" is replaced with "fix".
  4. If none of the patterns are found then "test" is always replaced with "fix".

Example:
So in the above example the word "test" shows up twice.
The first part is: "Hello, this AAis just a BB test"
Rule number 3 applies and passes. Both patterns are found before the "test" and it ends with BB.

The second part is: Hello, this AAis just a BB test string. I'm AA here to test" Here rule number 3 applies but does not pass.

The final result being:
"Hello, this AAis just a BB fix string. I'm AA here to test BB the regex."


Different solution:
Now, there are other ways to achieve this. For example, count how many times "test" is in a string and do some for loop where I keep track which pattern came last (if they exist) until I find "test" and take action based on which one came last. Repeat this process until all cases of "test" are found but this feels really inefficient.


My attempt at a regex solution
Initially, my issue was that everything was greedy. So [AA]*.*[BB]*.[^AA]+test resulted in everything until the last 'test' in the string when I just wanted the match up until the first "test" match and slowly iterate until I got to the last one.

So, I modified it to: [AA]*?.*[BB]+?[^AA]*?test?
Based on regex documentation appending a ? makes it non-greedy.
This is almost what I want, rule 2 and 3 are covered but this won't work for rule 1. So I'm not quite sure how to fix this regex pattern.

Also, how would I iterate my regex pattern over the entire string AND use re.sub to replace the words when needed?


Any help is greatly appreciated.

Upvotes: 0

Views: 207

Answers (1)

CleverLikeAnOx
CleverLikeAnOx

Reputation: 1496

I don't think trying to build a single regular expression to do everything is going to be a fruitful approach. Instead, let's use multiple regular expressions and a little programming to solve the problem:

def replace_test(string):
    aa_locs = [(m.start(), "aa") for m in re.finditer(AA, string)]
    bb_locs = [(m.start(),  "bb") for m in re.finditer(BB, string)]
    merged = sorted(aa_locs + bb_locs + [len(string), "end"])
    start = 0
    result = ""
    replacing = False
    for end, pattern_type in merged:
        if replacing:
             result += string[start:end].replace("test", "fix")
        else:
             result += string[start:end]
        if pattern_type == "bb":
             replacing = True
        start = end
    return result

It is a bit complex and could probably be cleaned up, but let me explain what this code does. First, we want to make a list of every time the state can change in order to break the string into regions where we will be replacing the word "test" and regions where we do not. We get a list of each time AA is found and a list of each time BB is found. We store these as tuples (index, pattern). That way we know where there is a possible state change. After that I merged these into a singled list. I also added a sentinel value that we will need to make sure we actually copy the whole string later on.

We know the initial state is not to replace and we start at the beginning of the string. In each iteration we take a portion of the string and add it to the result. After doing this we update the state based on what pattern we have just matched "aa" or "bb".

Upvotes: 1

Related Questions