Reputation: 11720
I am trying to replace specific words (separated by a specific word boundary of underscore) with Python. I know i can just do a string split and loop through the list items to replace the key phrases and then re-join it into string afterwards.. but it does not feel like the most elegant or the best way to go about doing it.
So, let's say I create a small function based on regex for the purposes of what i want to do, something like this:
def replace_words(string, rep_dict, separator):
regex = r'({0}|\b)({1})({2}|\b)'.format(re.escape(separator), '|'.join(rep_dict.keys()), re.escape(separator))
return re.sub(regex, lambda x: '{0}{1}{2}'.format(x.group(1), rep_dict[x.group(2)], x.group(3)), string)
Suppose i use any other separator, like for example an asterisk (*), then it works as intended:
rep_odds = {'first': '1st', 'third': '3rd', 'fifth': '5th'}
rep_evens = {'second': '2nd', 'fourth': '4th'}
orders = ['first', 'second', 'third', 'fourth', 'fifth']
before = '*'.join(orders)
after = replace_words(before, rep_odds, separator='*')
# returns: 1st*second*3rd*fourth*5th
after = replace_words(before, {**rep_odds, **rep_evens}, separator='*')
# returns: 1st*2nd*3rd*4th*5th
But if i change this to use the separator as underscore (_) instead, I get this unexpected and (wrong) behavior:
before = '_'.join(orders)
after = replace_words(before, rep_odds, separator='_')
# returns: 1st_second_3rd_fourth_5th <-- Good
after = replace_words(before, {**rep_odds, **rep_evens}, separator='_')
# returns: 1st_second_3rd_fourth_5th <-- What went wrong ?!
If someone can help me understand this behavior, I am still kind of new to learning regex and how it works in Python ... thanks.
Upvotes: 2
Views: 1338
Reputation: 13999
Neat problem. Here's the issue:
re.sub
does not allow one character to lie in multiple matching groups; once a character belongs to a match, it is consumed, unless you specify that the match be nonconsuming. When matching using the asterisk, a key fact was that a word boundary lies between an asterisk and a word character. Here are the matching groups when using the asterisk (the {0}
, {1}
, and {2}
in the lambda
):
('', 'first', '*')
('', 'second', '*')
('', 'third', '*')
('', 'fourth', '*')
('', 'fifth', '')
When the regex matcher hits the end of the first match, its cursor is between the first asterisk and the word second
, which is at a word boundary. Hence second*
is also a match, and then third*
, etc.
However, when you use an underscore, here are the corresponding matches:
('', 'first', '_')
('_', 'third', '_')
('_', 'fifth', '')
When the regex matcher hits the end of the first match, its cursor is between the first underscore and the word second
, which is not a word boundary. Since it's already passed the first underscore and isn't at a word boundary, it can't match (_|\b)second
. Hence there is no match until the next underscore after second
, and you can see that that match includes both underscores adjacent to third
.
In short, the first example was "lucky" in that after passing the separator character, you'd land in a word boundary, which was not the case for the second example.
To fix this, you can use a lookahead assertion, which will not consume the matched characters.
def replace_words(string, rep_dict, separator):
regex = r'({0}|\b)({1})((?={2}|\b).*?)'.format(
re.escape(separator), '|'.join(rep_dict.keys()), re.escape(separator)
)
return re.sub(
regex, lambda x: '{0}{1}{2}'.
format(x.group(1), rep_dict[x.group(2)], x.group(3)), string
)
The matches are now the following:
('', 'first', '')
('*', 'second', '')
('*', 'third', '')
('*', 'fourth', '')
('*', 'fifth', '')
Ignore the struck-through text below, which would have matched on word prefixes, e.g. *firstperson*
would have become *1stperson*
.
P.S. Splitting and rejoining is probably your best bet. This is most likely what re.sub is doing under the hood anyway, since strings are immutable.
To fix this, you can match only on the separator character preceding a keyword or the start of the string (alternatively, the separator character following a keyword or the end of the string).
def replace_words(string, rep_dict, separator):
regex = r'(^|{0})({1})'.format(
re.escape(separator), '|'.join(rep_dict.keys())
)
return re.sub(
regex, lambda x: print(x.groups()) or '{0}{1}'.
format(x.group(1), rep_dict[x.group(2)]), string
)
Upvotes: 3
Reputation: 1528
Here is another way:
z=dict(rep_odds.items() + rep_evens.items())
for i,item in enumerate(orders):
orders[i]=z.get(item)
Output:
print(orders)
['1st', '2nd', '3rd', '4th', '5th']
Upvotes: 0
Reputation: 26057
You don't need a regex
if I understand you properly.
rep_odds = {'first': '1st', 'third': '3rd', 'fifth': '5th'}
rep_evens = {'second': '2nd', 'fourth': '4th'}
orders = ['first', 'second', 'third', 'fourth', 'fifth']
print('_'.join([rep_odds.get(x, rep_evens.get(x, 0)) for x in orders]))
# 1st_2nd_3rd_4th_5th
You could generalize this to use with any separator or any order of orders
like:
def fun(sep, orders):
print(sep.join([rep_odds.get(x, rep_evens.get(x, 0)) for x in orders]))
fun('*', orders)
# 1st*2nd*3rd*4th*5th
Upvotes: 0
Reputation: 6915
It doesn't directly ansnwer your question, but isn't this:
def replace_words(s, rep_dict, sep):
return sep.join(rep_dict.get(word, word) for word in s.split(sep))
more elegant than the regex solution (which anyway uses join
)?
Upvotes: 0