Reputation: 363
I am trying to write a regular expression which would match any occurrence of ABC following XYZ anywhere in the string :
Ex. text - "Some ABC text followed by XYZ followed by multiple ABC, more ABC, more ABC"
i.e., the regex should match three ABC's coming after XYZ.
Any clues?
Upvotes: 2
Views: 2465
Reputation: 428
There is a nifty Counter object in collections which may be of help. A Counter object is a dictionary with the keys being the individual items, and the values the counts. Example:
Counter('hello there hello'.split()) # {'hello':2, 'there', 1}
Since we want to count words, we must split the phrase wherever we see whitespace. This is the default behavior of the split method. Here is an example script which uses Counter. The lower half could be adapted into a function if required.
from collections import Counter
def count_frequency(phrase):
""" Return a dictionary with {word: num_of_occurences} """
counts = Counter(phrase.split())
return counts
def replace_word(target_word, replacement, phrase):
""" Replaces *word* with *replacement* in string *phrase* """
phrase = phrase.split()
for count, word in enumerate(phrase):
if word == target_word:
phrase[count] = replacement
return ''.join(phrase)
phrase = "hello there hello hello"
word_counts = count_frequency(phrase)
new_phrase = ''
replacement = 'replaced'
for word in word_counts:
if word_counts[word] > 2:
phrase = phrase.replace(word, replacement)
print(phrase)
Upvotes: 0
Reputation: 359
You could take an iterative approach:
s = "Some ABC text followed by XYZ followed by multiple ABC, more ABC, more ABC"
pattern = re.compile(r'(?<=XYZ)(.*?)ABC')
while pattern.search(s):
s = pattern.sub(r'\1REPLACED', s)
print s
Output:
Some ABC text followed by XYZ followed by multiple REPLACED, more REPLACED, more REPLACED
Upvotes: 1
Reputation: 11322
Something like this? r"(?<=XYZ)((?:ABC)+)"
. This will match only the occurrences of ABC
when they follow XYZ
, but will not include XYZ
itself.
EDIT
Looks like I misunderstood OP's original question. The easiest way to do this would be to first find the string XYZ
. Save the starting position of XYZ
. Use the starting position as extra argument to p.finditer(string, startpos)
. Please note that this will only work with compiled regular expressions, so you need to compile your pattern first.
The pattern you need is simply r"(ABC)"
.
Alternatively, you can use p.sub()
, which will also do the substitution, but for this to work on only a part of the string, you will need to create a substring first. p.sub()
does not have a startpos
parameter.
Upvotes: 1
Reputation: 1122302
Just match the literal XYZ
and group on the repeated ABC
:
r'XYZ((?:ABC)+)'
The (?:ABC)+
pattern matches a set of literal characters at least once, and the whole group is preceded by a literal XYZ.
This is quite basic regular expression 101, you should read a good tutorial on regular expression matching to get started.
Upvotes: 1