Reputation: 7338
I have multiple (>30) compiled regex's
regex_1 = re.compile(...)
regex_2 = re.compile(...)
#... define multiple regex's
regex_n = re.compile(...)
I then have a function which takes a text
and replaces some of its words using every one of the regex's above and the re.sub
method as follows
def sub_func(text):
text = re.sub(regex_1, "string_1", text)
# multiple subsitutions using all regex's ...
text = re.sub(regex_n, "string_n", text)
return text
Question: Is there a more efficient way to make these replacements?
The regex's cannot be generalized or simplified from their current form.
I feel like reassigning the value of text
each time for every regex is quite slow, given that the function only replaces a word or two from the entirety of text
for each reassignment. Also, given that I have to do this for multiple documents, that slows things down even more.
Thanks in advance!
Upvotes: 1
Views: 1550
Reputation: 3555
we can pass a function to re.sub
repl argument
simplify to 3 regex for easier understanding
assuming regex_1, regex_2, and regex_3 will be 111,222 and 333 respectively. Then, regex_replace will be the list holding string that will be use for replace follow the order of regex_1, regex_2 and regex_3.
Not sure how much this will improve the runtime though, give it a try
import re
regex_x = re.compile('(111)|(222)|(333)')
regex_replace = ['one', 'two', 'three']
def sub_func(text):
return re.sub(regex_x, lambda x:regex_replace[x.lastindex-1], text)
>>> sub_func('testing 111 222 333')
>>> 'testing one two three'
Upvotes: 1
Reputation: 40801
Reassigning a value takes constant time in Python. Unlike in languages like C, variables are more of a "name tag". So, changing what the name tag points to takes very little time.
If they are constant strings, I would collect them into a tuple:
regexes = (
(regex_1, 'string_1'),
(regex_2, 'string_2'),
(regex_3, 'string_3'),
...
)
And then in your function, just iterate over the list:
def sub_func_2(text):
for regex, sub in regexes:
text = re.sub(regex, sub, text)
return text
But if your regexes are actually named regex_1
, regex_2
, etc., they probably should be directly defined in a list of some sort.
Also note, if you are doing replacements like 'cat'
-> 'dog'
, the str.replace()
method might be easier (text = text.replace('cat', 'dog')
), and it will probably be faster.
If your strings are very long, and re-making it from scratch with the regexes might take very long. An implementation of @Oliver Charlesworth's method that was mentioned in the comments could be:
# Instead of this:
regexes = (
('1(1)', '$1i'),
('2(2)(2)', '$1a$2'),
('(3)(3)3', '$1a$2')
)
# Merge the regexes:
regex = re.compile('(1(1))|(2(2)(2))|((3)(3)3)')
substitutions = (
'{1}i', '{1}a{2}', '{1}a{2}'
)
# Keep track of how many groups are in each alternative
group_nos = (1, 2, 2)
cumulative = [1]
for i in group_nos:
cumulative.append(cumulative[-1] + i + 1)
del i
cumulative = tuple(zip(substitutions, cumulative))
def _sub_func(match):
iter_ = iter(cumulative)
for sub, x in iter_:
if match.group(x) is not None:
return sub.format(*map(match.group, range(x, next(iter_)[1])))
def sub_func(text):
return re.sub(regex, _sub_func, text)
But this breaks down if you have overlapping text that you need to substitute.
Upvotes: 3