ScientiaEtVeritas
ScientiaEtVeritas

Reputation: 5278

How can I pass a callback to re.sub, but still inserting match captures?

Consider:

text = "abcdef"
pattern = "(b|e)cd(b|e)"

repl = [r"\1bla\2", r"\1blabla\2"]
text = re.sub(pattern, lambda m: random.choice(repl), text)

I want to replace matches randomly with entries of a list repl. But when using lambda m: random.choice(repl) as a callback, it doesn't replace \1, \2 etc. with its captures any more, returning "\1bla\2" as plain text.

I've tried to look up re.py on how they do it internally, so I might be able to call the same internal function, but it doesn't seem trivial.

The example above returns a\1bla\2f or a\1blabla\2f while abblaef or abblablaef are valid options in my case.

Note that I'm using a function, because, in case of several matches like text = "abcdef abcdef", it should randomly choose a replacement from repl for every match – instead of using the same replacement for all matches.

Upvotes: 7

Views: 1031

Answers (3)

RootTwo
RootTwo

Reputation: 4418

In the example, the capture groups are put back where they were without change. So change the pattern to use lookahead and look behind assertions instead:

replacements = ['bla', 'blabla']
re.sub(r"(?<=b|e)cd(?=b|e)", lambda mo:random.choice(replacements), text)

This matches cd if preceeded by a b|e and followed by b|e.

Alternatively, the replacement function receives a match object, so it has access to all the match groups:

re.sub(pattern, lambda mo:f"{mo[1]}{random.choice(replacements)}{mo[2]}", text)

where mo is the match object, mo[1] is the first capture group and mo[2] is the second.

Upvotes: 0

Nick
Nick

Reputation: 147206

One way to do this (and ensure random replacements) is to nest calls to re.sub:

text = "abcdef abcdef"
pattern = "(b|e)cd(b|e)"

repl = [r"\1bla\2", r"\1blabla\2"]
text = re.sub(pattern, lambda m: re.sub(r'\\(\d+)', lambda m1: m.group(int(m1.group(1))), random.choice(repl)), text)

print(text)

Output varies between

abblaef abblaef
abblaef abblablaef
abblablaef abblaef
abblablaef abblablaef

It turns out my nested call was basically the equivalent of m.expand, as described in Mark Meyer's answer.

Upvotes: 1

Mark
Mark

Reputation: 92460

If you pass a function you lose the automatic escaping of backreferences. You just get the match object and have to do the work. So you could:

Pick a string in the regex rather than passing a function:

text = "abcdef"
pattern = "(b|e)cd(b|e)"

repl = [r"\1bla\2", r"\1blabla\2"]
re.sub(pattern, random.choice(repl), text)
# 'abblaef' or 'abblablaef'

Or write a function that processes the match object and allows more complex processing. You can take advantage of expand to use back references:

text = "abcdef abcdef"
pattern = "(b|e)cd(b|e)"

def repl(m):
    repl = [r"\1bla\2", r"\1blabla\2"]           
    return m.expand(random.choice(repl))


re.sub(pattern, repl, text)

# 'abblaef abblablaef' and variations

You can, or course, put that function into a lambda:

repl = [r"\1bla\2", r"\1blabla\2"]
re.sub(pattern, lambda m: m.expand(random.choice(repl)), text)

Upvotes: 8

Related Questions