roopesh
roopesh

Reputation: 363

Regex for matching any occurrence of ABC following XYZ anywhere in the string

I am trying to write a regular expression which would match any occurrence of ABC following XYZ anywhere in the string :

Ex. text - "Some ABC text followed by XYZ followed by multiple ABC, more ABC, more ABC"

i.e., the regex should match three ABC's coming after XYZ.

Any clues?

Upvotes: 2

Views: 2465

Answers (4)

Ethan Henderson
Ethan Henderson

Reputation: 428

There is a nifty Counter object in collections which may be of help. A Counter object is a dictionary with the keys being the individual items, and the values the counts. Example:

Counter('hello there hello'.split()) # {'hello':2, 'there', 1}

Since we want to count words, we must split the phrase wherever we see whitespace. This is the default behavior of the split method. Here is an example script which uses Counter. The lower half could be adapted into a function if required.

from collections import Counter

def count_frequency(phrase):
    """ Return a dictionary with {word: num_of_occurences} """
    counts = Counter(phrase.split())
    return counts

def replace_word(target_word, replacement, phrase):
    """ Replaces *word* with *replacement* in string *phrase* """
    phrase = phrase.split()

    for count, word in enumerate(phrase):
        if word == target_word:
            phrase[count] = replacement

    return ''.join(phrase)

phrase = "hello there hello hello"
word_counts = count_frequency(phrase)
new_phrase = ''
replacement = 'replaced'

for word in word_counts:
    if word_counts[word] > 2:
        phrase = phrase.replace(word, replacement)

print(phrase)

Upvotes: 0

tojrobinson
tojrobinson

Reputation: 359

You could take an iterative approach:

s = "Some ABC text followed by XYZ followed by multiple ABC, more ABC, more ABC"

pattern = re.compile(r'(?<=XYZ)(.*?)ABC')
while pattern.search(s):
   s = pattern.sub(r'\1REPLACED', s)

print s

Output:

Some ABC text followed by XYZ followed by multiple REPLACED, more REPLACED, more REPLACED

Upvotes: 1

Hans Then
Hans Then

Reputation: 11322

Something like this? r"(?<=XYZ)((?:ABC)+)". This will match only the occurrences of ABC when they follow XYZ, but will not include XYZ itself.

EDIT

Looks like I misunderstood OP's original question. The easiest way to do this would be to first find the string XYZ. Save the starting position of XYZ. Use the starting position as extra argument to p.finditer(string, startpos). Please note that this will only work with compiled regular expressions, so you need to compile your pattern first.

The pattern you need is simply r"(ABC)".

Alternatively, you can use p.sub(), which will also do the substitution, but for this to work on only a part of the string, you will need to create a substring first. p.sub() does not have a startpos parameter.

Upvotes: 1

Martijn Pieters
Martijn Pieters

Reputation: 1122302

Just match the literal XYZ and group on the repeated ABC:

r'XYZ((?:ABC)+)'

The (?:ABC)+ pattern matches a set of literal characters at least once, and the whole group is preceded by a literal XYZ.

This is quite basic regular expression 101, you should read a good tutorial on regular expression matching to get started.

Upvotes: 1

Related Questions