Regular Expression String Mangling Efficiency in Python - Explanation for Slowness?

Question

I'm hoping someone can help explain why Python's re module seems to be so slow at chopping up a very large string for me.

I have string ("content") that is very nearly 600k bytes in size. I'm trying to hack off just the beginning part of it, a variable number of lines, delimited by the text ">>>FOOBAR<<<".

The literal completion time is provided for comparison purposes - the script that this snippet is in takes a bit to run naturally.

The first and worst method:

import re
content = "Massive string that is 600k and contains >>>FOOBAR<<< about 200 lines in"
content = re.sub(".*>>>FOOBAR<<<", ">>>FOOBAR<<<", content, flags=re.S)

Has a completion time of:

real    6m7.213s

While a wordy method:

content = "Massive string that is 600k and contains >>>FOOBAR<<< about 200 lines in"
newstir = ""
flag = False
for l in content.split('
'):
    if re.search(">>>FOOBAR<<<", l):
        flag = True
    #End if we encountered our flag line
    if flag:
        newstir += l
#End loop through content
content = newstir

Has an expected completion time of:

real    1m5.898s

And using a string's .split method:

content = "Massive string that is 600k and contains >>>FOOBAR<<< about 200 lines in"
content = content.split(">>>FOOBAR<<<")[1]

Also has an expected completion time of:

real    1m6.427s

What's going on here? Why is my re.sub call so ungodly slow for the same string?

Casimir et Hippolyte · Accepted Answer

There is no good way to do it with a pattern starting either with .* or .*? in particular with large data, since the first will cause a lot of backtracking and the second must test for each taken character if the following subpattern fails (until it succeeds). Using a non-greedy quantifier isn't faster than using a greedy quantifier.

I suspect that your ~600k content data are in a file at the beginning. Instead of loading the whole file and storing its content to a variable, work line by line. In this way you will preserve memory and avoid to split and to create a list of lines. Second thing, if you are looking for a literal string, don't use a regex method, use a simple string method like find that is faster:

with open('yourfile') as fh:
    for line in fh:
        result += line
        if line.find('>>>FOOBAR<<<') > -1:
            break

If >>>FOOBAR<<< isn't a simple literal string but a regex pattern, in this case compile the pattern before:

pat = re.compile(r'>>>[A-Z]+<<<')

with open('yourfile') as fh:
    for line in fh:
        result += line
        if pat.search(line):
            break

Regular Expression String Mangling Efficiency in Python - Explanation for Slowness?

Answers (1)

Related Questions