Reputation: 3134
I'm hoping someone can help explain why Python's re module seems to be so slow at chopping up a very large string for me.
I have string ("content") that is very nearly 600k bytes in size. I'm trying to hack off just the beginning part of it, a variable number of lines, delimited by the text ">>>FOOBAR<<<".
The literal completion time is provided for comparison purposes - the script that this snippet is in takes a bit to run naturally.
The first and worst method:
import re
content = "Massive string that is 600k and contains >>>FOOBAR<<< about 200 lines in"
content = re.sub(".*>>>FOOBAR<<<", ">>>FOOBAR<<<", content, flags=re.S)
Has a completion time of:
real 6m7.213s
While a wordy method:
content = "Massive string that is 600k and contains >>>FOOBAR<<< about 200 lines in"
newstir = ""
flag = False
for l in content.split('\n'):
if re.search(">>>FOOBAR<<<", l):
flag = True
#End if we encountered our flag line
if flag:
newstir += l
#End loop through content
content = newstir
Has an expected completion time of:
real 1m5.898s
And using a string's .split method:
content = "Massive string that is 600k and contains >>>FOOBAR<<< about 200 lines in"
content = content.split(">>>FOOBAR<<<")[1]
Also has an expected completion time of:
real 1m6.427s
What's going on here? Why is my re.sub call so ungodly slow for the same string?
Upvotes: 0
Views: 106
Reputation: 89557
There is no good way to do it with a pattern starting either with .*
or .*?
in particular with large data, since the first will cause a lot of backtracking and the second must test for each taken character if the following subpattern fails (until it succeeds). Using a non-greedy quantifier isn't faster than using a greedy quantifier.
I suspect that your ~600k content data are in a file at the beginning. Instead of loading the whole file and storing its content to a variable, work line by line. In this way you will preserve memory and avoid to split and to create a list of lines. Second thing, if you are looking for a literal string, don't use a regex method, use a simple string method like find
that is faster:
with open('yourfile') as fh:
for line in fh:
result += line
if line.find('>>>FOOBAR<<<') > -1:
break
If >>>FOOBAR<<<
isn't a simple literal string but a regex pattern, in this case compile the pattern before:
pat = re.compile(r'>>>[A-Z]+<<<')
with open('yourfile') as fh:
for line in fh:
result += line
if pat.search(line):
break
Upvotes: 1