errorinpersona
errorinpersona

Reputation: 410

Python regex performance: Best way to iterate over texts with thousands of regex

I researched a lot but did not find anything that really helped me. Maybe my approach is just weird - maybe someone can get my thoughts in the right direction.

So here is the situation:

I need to process large amounts of texts (hundreds of thousand). In those texts I need to find and process certain strings:

  1. Certain “static” substrings (like case numbers), that I pull from my database (also hundreds of thousand)
  2. Strings that I match with a regex that is built dynamically to match every possible occurrence – where the last part of the regex will be set dynamically

So as becomes clear, this results in a stupid amount of iterations, because every text needs to be fed in a function that runs it through hundreds of thousand regexes – and after all will lead to really long runtimes.

Is there a better and faster way to accomplish the desired task? The way it is done now works but is painfully slow and puts heavy load on the server for weeks.

Some example code to illustrate my thoughts:

import re

cases = []          # 100 000 case numbers from db
suffixes = []       #  500 diffrent suffixes to try from db

texts = []          # 100 000 for the beginning - will become less after initial run

def process_item(text: str) -> str:
    for s in suffixes:
        pattern = '(...)(.*?)(%s|...)' % s
        x = re.findall(pattern, text, re.IGNORECASE)
        for match in x:
            # process the matches, where I need to know which suffix matched
            pass
    for c in cases:
        escaped = re.escape(c)
        x = re.findall(escaped, text, re.IGNORECASE)
        for match in x:
            # process the matches, where I need to know which number matched
            pass

    return text


for text in texts:
    processed = process_item(text)

Every idea is highly appreciated!

Upvotes: 0

Views: 1413

Answers (1)

Marek Schwarz
Marek Schwarz

Reputation: 608

I can't comment, but just some thoughts:

From what you've posted it looks like that the stuff you want to search for are always the same, so why don't just join them in big regexp and compile that big regexp before running the loop.

This way you do not compile the regular expression for every iteration, but just once.

e.g.

import re

cases = []          # 100 000 case numbers from db
suffixes = []       #  500 diffrent suffixes to try from db

texts = []          # 100 000 for the beginning - will become less after initial run

bre1 = re.compile('|'.join(suffixes), re.IGNORECASE)
bre2 = re.compile('|'.join([re.escape(c) for c in cases]), re.IGNORECASE)

def process_item(text: str) -> str:
    x = re.findall(bre1, text)
    for match in x:
        # process the matches, where I need to know which suffix matched
        pass

   x = re.findall(bre1, text)
   for match in x:
       # process the matches, where I need to know which number matched
       pass

    return text


for text in texts:
    processed = process_item(text)

If you could reliably find the case number in the text (e.g. if it has some identifier before it) it would be better to find the case number using re.search and have the case numbers in set and test for membership in that set.

e.g.

cases = ["123", "234"]
cases_set = set(cases)

texts = ["id:123", "id:548"]

sre = re.compile(r'(?<=id:)\d{3}')
for t in texts:
    m = re.search(sre, t)
    if m and m.group() in cases_set:
        # do stuff ....
        pass

Upvotes: 3

Related Questions