Wizard.Ritvik
Wizard.Ritvik

Reputation: 11720

Regex to replace words separated by underscore in Python

I am trying to replace specific words (separated by a specific word boundary of underscore) with Python. I know i can just do a string split and loop through the list items to replace the key phrases and then re-join it into string afterwards.. but it does not feel like the most elegant or the best way to go about doing it.

So, let's say I create a small function based on regex for the purposes of what i want to do, something like this:

def replace_words(string, rep_dict, separator):
    regex = r'({0}|\b)({1})({2}|\b)'.format(re.escape(separator), '|'.join(rep_dict.keys()), re.escape(separator))
    return re.sub(regex, lambda x: '{0}{1}{2}'.format(x.group(1), rep_dict[x.group(2)], x.group(3)), string)

Suppose i use any other separator, like for example an asterisk (*), then it works as intended:

 rep_odds = {'first': '1st', 'third': '3rd', 'fifth': '5th'}
 rep_evens = {'second': '2nd', 'fourth': '4th'}
 orders = ['first', 'second', 'third', 'fourth', 'fifth']
 before = '*'.join(orders)
 after = replace_words(before, rep_odds, separator='*')
 # returns: 1st*second*3rd*fourth*5th
 after = replace_words(before, {**rep_odds, **rep_evens}, separator='*')
 # returns: 1st*2nd*3rd*4th*5th

But if i change this to use the separator as underscore (_) instead, I get this unexpected and (wrong) behavior:

before = '_'.join(orders)
after = replace_words(before, rep_odds, separator='_')
# returns: 1st_second_3rd_fourth_5th <-- Good
after = replace_words(before, {**rep_odds, **rep_evens}, separator='_')
# returns: 1st_second_3rd_fourth_5th <-- What went wrong ?!

If someone can help me understand this behavior, I am still kind of new to learning regex and how it works in Python ... thanks.

Upvotes: 2

Views: 1338

Answers (4)

BallpointBen
BallpointBen

Reputation: 13999

Neat problem. Here's the issue:

re.sub does not allow one character to lie in multiple matching groups; once a character belongs to a match, it is consumed, unless you specify that the match be nonconsuming. When matching using the asterisk, a key fact was that a word boundary lies between an asterisk and a word character. Here are the matching groups when using the asterisk (the {0}, {1}, and {2} in the lambda):

('', 'first', '*')
('', 'second', '*')
('', 'third', '*')
('', 'fourth', '*')
('', 'fifth', '')

When the regex matcher hits the end of the first match, its cursor is between the first asterisk and the word second, which is at a word boundary. Hence second* is also a match, and then third*, etc.

However, when you use an underscore, here are the corresponding matches:

('', 'first', '_')
('_', 'third', '_')
('_', 'fifth', '')

When the regex matcher hits the end of the first match, its cursor is between the first underscore and the word second , which is not a word boundary. Since it's already passed the first underscore and isn't at a word boundary, it can't match (_|\b)second. Hence there is no match until the next underscore after second, and you can see that that match includes both underscores adjacent to third.

In short, the first example was "lucky" in that after passing the separator character, you'd land in a word boundary, which was not the case for the second example.

To fix this, you can use a lookahead assertion, which will not consume the matched characters.

def replace_words(string, rep_dict, separator):
    regex = r'({0}|\b)({1})((?={2}|\b).*?)'.format(
        re.escape(separator), '|'.join(rep_dict.keys()), re.escape(separator)
    )

    return re.sub(
        regex, lambda x: '{0}{1}{2}'.
        format(x.group(1), rep_dict[x.group(2)], x.group(3)), string
    )

The matches are now the following:

('', 'first', '')
('*', 'second', '')
('*', 'third', '')
('*', 'fourth', '')
('*', 'fifth', '')

Ignore the struck-through text below, which would have matched on word prefixes, e.g. *firstperson* would have become *1stperson*.

P.S. Splitting and rejoining is probably your best bet. This is most likely what re.sub is doing under the hood anyway, since strings are immutable.

To fix this, you can match only on the separator character preceding a keyword or the start of the string (alternatively, the separator character following a keyword or the end of the string).

def replace_words(string, rep_dict, separator):
    regex = r'(^|{0})({1})'.format(
        re.escape(separator), '|'.join(rep_dict.keys())
    )

    return re.sub(
        regex, lambda x: print(x.groups()) or '{0}{1}'.
        format(x.group(1), rep_dict[x.group(2)]), string
    )

Upvotes: 3

1pluszara
1pluszara

Reputation: 1528

Here is another way:

z=dict(rep_odds.items() + rep_evens.items())
for i,item in enumerate(orders):
    orders[i]=z.get(item)

Output:

print(orders)
['1st', '2nd', '3rd', '4th', '5th'] 

Upvotes: 0

Austin
Austin

Reputation: 26057

You don't need a regex if I understand you properly.

rep_odds = {'first': '1st', 'third': '3rd', 'fifth': '5th'}
rep_evens = {'second': '2nd', 'fourth': '4th'}
orders = ['first', 'second', 'third', 'fourth', 'fifth']

print('_'.join([rep_odds.get(x, rep_evens.get(x, 0)) for x in orders]))
# 1st_2nd_3rd_4th_5th

You could generalize this to use with any separator or any order of orders like:

def fun(sep, orders):
    print(sep.join([rep_odds.get(x, rep_evens.get(x, 0)) for x in orders]))

fun('*', orders)
# 1st*2nd*3rd*4th*5th

Upvotes: 0

taras
taras

Reputation: 6915

It doesn't directly ansnwer your question, but isn't this:

def replace_words(s, rep_dict, sep):
    return sep.join(rep_dict.get(word, word) for word in s.split(sep))

more elegant than the regex solution (which anyway uses join)?

Upvotes: 0

Related Questions