Reputation: 410
I researched a lot but did not find anything that really helped me. Maybe my approach is just weird - maybe someone can get my thoughts in the right direction.
So here is the situation:
I need to process large amounts of texts (hundreds of thousand). In those texts I need to find and process certain strings:
So as becomes clear, this results in a stupid amount of iterations, because every text needs to be fed in a function that runs it through hundreds of thousand regexes – and after all will lead to really long runtimes.
Is there a better and faster way to accomplish the desired task? The way it is done now works but is painfully slow and puts heavy load on the server for weeks.
Some example code to illustrate my thoughts:
import re
cases = [] # 100 000 case numbers from db
suffixes = [] # 500 diffrent suffixes to try from db
texts = [] # 100 000 for the beginning - will become less after initial run
def process_item(text: str) -> str:
for s in suffixes:
pattern = '(...)(.*?)(%s|...)' % s
x = re.findall(pattern, text, re.IGNORECASE)
for match in x:
# process the matches, where I need to know which suffix matched
pass
for c in cases:
escaped = re.escape(c)
x = re.findall(escaped, text, re.IGNORECASE)
for match in x:
# process the matches, where I need to know which number matched
pass
return text
for text in texts:
processed = process_item(text)
Every idea is highly appreciated!
Upvotes: 0
Views: 1413
Reputation: 608
I can't comment, but just some thoughts:
From what you've posted it looks like that the stuff you want to search for are always the same, so why don't just join them in big regexp and compile that big regexp before running the loop.
This way you do not compile the regular expression for every iteration, but just once.
e.g.
import re
cases = [] # 100 000 case numbers from db
suffixes = [] # 500 diffrent suffixes to try from db
texts = [] # 100 000 for the beginning - will become less after initial run
bre1 = re.compile('|'.join(suffixes), re.IGNORECASE)
bre2 = re.compile('|'.join([re.escape(c) for c in cases]), re.IGNORECASE)
def process_item(text: str) -> str:
x = re.findall(bre1, text)
for match in x:
# process the matches, where I need to know which suffix matched
pass
x = re.findall(bre1, text)
for match in x:
# process the matches, where I need to know which number matched
pass
return text
for text in texts:
processed = process_item(text)
If you could reliably find the case number
in the text
(e.g. if it has some identifier before it) it would be better to find the case number using re.search
and have the case numbers in set
and test for membership in that set.
e.g.
cases = ["123", "234"]
cases_set = set(cases)
texts = ["id:123", "id:548"]
sre = re.compile(r'(?<=id:)\d{3}')
for t in texts:
m = re.search(sre, t)
if m and m.group() in cases_set:
# do stuff ....
pass
Upvotes: 3