Reputation: 744

Doing multiple, successive regex replacements in Python. Inefficient?

First off - my code works. It just runs slowly, and I'm wondering if i'm missing something that will make it more efficient. I'm parsing PDFs with python (and yes, I know that this should be avoided if at all possible).

My problem is that i have to do several rather complex regex substitutions - and when i say substitution, I really mean deleting. I have done the ones that strip out the most data first so that the next expressions don't need to analyze too much text, but that's all I can think of to speed things up.

I'm pretty new to python and regexes, so it's very conceivable this could be done better.

Thanks for reading.

    regexPagePattern = r"(Wk)\d{1,2}.\d{2}(\d\.\d{1,2})"
    regexCleanPattern = r"(\(continued\))?((II)\d\.\d{1,2}|\d\.\d{1,2}(II)|\d\.\d{1,2})"
    regexStartPattern = r".*(II)(\s)?(INDEX OF CHARTS AFFECTED)"
    regexEndPattern = r"(II.)\d{1,5}\((P|T)\).*"
    contentRaw = re.sub(regexStartPattern,"",contentRaw)
    contentRaw = re.sub(regexEndPattern,"",contentRaw)
    contentRaw = re.sub(regexPagePattern,"",contentRaw)
    contentRaw = re.sub(regexCleanPattern,"",contentRaw)

Upvotes: 4

Answers (2)

hochl

Reputation: 12920

I'm not sure if you do this inside of a loop. If not the following does not apply.

If you use a pattern multiple times you should compile it using re.compile( ... ). This way the pattern is only compiled once. The speed increase should be huge. Minimal example:

>>> a="a b c d e f"
>>> re.sub(' ', '-', a)
'a-b-c-d-e-f'
>>> p=re.compile(' ')
>>> re.sub(p, '-', a)
'a-b-c-d-e-f'

Another idea: Use re.split( ... ) instead of re.sub and operate on the array with the resulting fragments of your data. I'm not entirely sure how it is implemented, but I think re.sub creates text fragments and merges them into one string in the end, which is expensive. After the last step you can join the array using " ".join(fragments). Obviously, This method will not work if your patterns overlap somewhere.

It would be interesting to get timing information for your program before and after your changes.

Upvotes: 4

marue

Reputation: 5726

Regex are always the last choice when trying to decode strings. So if you see another possibility to solve your problem, use that.

That said, you could use re.compile to precompile your regex patterns:

regexPagePattern = re.compile(r"(Wk)\d{1,2}.\d{2}(\d\.\d{1,2})")
regexPagePattern.sub("",contentRaw)

That should speed things up a bit (a pretty nice bit ;) )

Upvotes: 0

Doing multiple, successive regex replacements in Python. Inefficient?

Answers (2)

Related Questions