LaiKash
LaiKash

Reputation: 1

Python regex catastrophic backtracking in large strings (JavaScript parser)

I am trying to parse a beautified JavaScript file with huge functions. What I am trying to do is to separate each function into a match object to then process them individually to do other things.

An example could be:

__d(function(e, t, n, r, i, l, a) {

//AnyCharacters

}, 93, [27, 38, 40, 37, 94, 98, 99, 32]);

I am trying the following regex:

(?s)__d\(function\((\w+,\s)+\w+\)\s\{(.*?)\},\s\d+(.*?)\d\]\)\;

For more context I am trying to write each function to a file after some more proccessing:

functions_sep_regex = re.compile(r'(?s)__d\(function\((\w+,\s)+\w+\)\s\{(.*?)\},\s\d+(.*?)\d\]\)\;')

functions_sep = functions_sep_regex.finditer(res)

for functions in functions_sep:
        # Do something with functions.group(0))

The problem with the backtracking is the first (.*?) as I am trying to get any character between the start and the end of the function.

The regular expression must backtrack, it is the expected, as it is trying to match any character (even new line characters) but due to this error the engine crashes.

Is there a way to avoid this "crash"?

EDIT:

Reproducible example: pastebin.com/PcdSWnWG

Upvotes: 0

Views: 202

Answers (1)

Jan
Jan

Reputation: 43179

The problem is not the (.*?) but the nested quantifiers:

functions_sep_regex = re.compile(r'(?s)__d\(function\((\w+,\s)+\w+\)\s\{(.*?)\},\s\d+(.*?)\d\]\)\;')
#                                                           ^^^

This group is likely to explode as the regex engine wants to report a match.
Either use ++ (possessive) or rephrase this part of your expression.

Upvotes: 1

Related Questions