Bear
Bear

Reputation: 5152

Speed up a series of regex replacement in python

My python script would read each line in file and do many regex replacements in each line.

If the regex success, skip to the next line

Is there any way to speed up this kind of script?
Is it worth to call subn instead and check if replacement done and then skip to the remain one?
If I compile the regex, is it possible to store all the compiled regex in memory?

for file in files:  
     for line in file:  
         re.sub() # <--- ~ 100 re.sub

PS: the replacement vaires for each regex

Upvotes: 0

Views: 682

Answers (2)

MRAB
MRAB

Reputation: 20644

As @Tim Pietzcker said, you could reduce the number of regexes by making them alternatives. You can determine which alternative matched by the using the 'lastindex' attribute of the match object.

Here's an example of what you could do:

>>> import re
>>> replacements = {1: "<UPPERCASE LETTERS>", 2: "<lowercase letters>", 3: "<Digits>"}
>>> def replace(m):
...     return replacements[m.lastindex]
...
>>> re.sub(r"([A-Z]+)|([a-z]+)|([0-9]+)", replace, "ABC def 789")
'<UPPERCASE LETTERS> <lowercase letters> <Digits>'

Upvotes: 2

Tim Pietzcker
Tim Pietzcker

Reputation: 336078

You should probably do three things:

  1. Reduce the number of regexes. Depending on differences in the substitution part, you might be able to combine them all into a single one. Using careful alternation, you can determine the sequence in which parts of the regex will be matched.
  2. If possible (depending on file size), read the file into memory completely.
  3. Compile your regex (only for readability; it won't matter in terms of speed as long as the number of regexes stays below 100).

This gives you something like:

regex = re.compile(r"My big honking regex")
for datafile in files:
    content = datafile.read()
    result = regex.sub("Replacement", content)

Upvotes: 2

Related Questions