Adam
Adam

Reputation: 558

re.sub (python) substitute part of the matched string

I have a series of strings which are identifiable by finding a substring "p" tag followed by at least two CAPITAL letters.

Input:

<p>JIM <p>SALLY <p>ROBERT <p>Eric

I want to change the "p" tag to an "i" tag if it's followed by those two capital letters (so not the last one, 'Eric').

Desired output:

<i>JIM <i>SALLY <i>ROBERT <p>Eric

I've tried this using regular expressions in Python:

import re
Mytext = "<p>JIM <p>SALLY <p>ROBERT <p>Eric"

changeTags = re.sub('<p>[A-Z]{2}', '<i>' + re.search('<p>[A-Z]{2}', Mytext).group()[-2:], Mytext)
print changeTags

But the output uses "i" tag + JI in every instance, rather than interating through to use SA and then RO in entries 2 and 3.

<i>JIM <i>JILLY <i>JIBERT <p>Eric

I believe the problem is that I don't understand the .group() method properly. Can anyone advise what I've done wrong?

Thank you.

Upvotes: 1

Views: 425

Answers (2)

Juan Diego Godoy Robles
Juan Diego Godoy Robles

Reputation: 14945

Another way using look-ahead assertion:

re.sub(r'<p>(?=[A-Z]{2,})','<i>',MyText)

Upvotes: 1

shx2
shx2

Reputation: 64308

Your inner re.search is only evaluted once, and the result is passed as one of the parameters to re.sub. This can't possible capture all the capital-letters-pairs, only the first one. This means your approach cannot work, not merely your understanding of groups.

Furthermore, using groups is unnecessary.


You need to capture the capital letters using parenthesis, and reference it as \1 in the substitution expression:

re.sub('<p>([A-Z]{2})', r'<i>\1', Mytext)

\1 here means: replace with the substring matched by the first (...) in the regular expression. (docs)

Note the leading r in front of the substitution string, to make it raw.

Upvotes: 0

Related Questions