Why the performance of two regex using capturing groups varies so differently?

Question

I have two regex using the same capturing group, but their performance varied very differently.

My programming language is python3.8.

Regex Description

The two regex are

new\s+String\s*(?P$((?:[^()]++|(?&aux1))*)$)

and

([\w_\.]+(?P$(?:[^()]++|(?&aux1))*$)*+)\s*\.\s*((?:remove|contains|retain)(?:All)?)\s*(?&aux1)

The named capturing group (?P$(?:[^()]++|(?&aux1))*$) matches paired ( and ), and the content between them. For example ( abc() ).

The first regex matches string like new String("This is a string.").

The second one matches string like a.contains(b) or a.getCollection().contains( anotherCollection )

How to reproduce

I want to know the performance of capturing groups against large input, so I download an open-source Github project elasticsearch, which contains 13724 Java files.

I wrote a script to run the two regex against those files line by line. However, I found the first regex only used 1.6286 seconds, while the second regex used 50.6662 seconds.

I don't know what makes the second regex so slow, when the two regex are so similar and shares the same capturing group.

Wiktor Stribiżew · Accepted Answer

The problem is not related to the nested parentheses pattern group.

The [\w_\.]+ part that is located at the very pattern beginning makes all the difference. It is quantified with the + quantifier that can match any one or more occurrences of a wide range of chars, letters, digits, dots, and then there are a couple more patterns to match. When the subsequent patterns do not match, backtracking triggers and can make its way back to the initial pattern, and there may be a lot of positions to test the regex against. It will lead to catastrophic backtracking.

Solution

Always try to "anchor" your regex patterns at some constant pattern. In this case, if you plan to match at word boundaries, put the word boundary pattern at the start and use \b\w[\w.]* instead of [\w_\.]+:

(\b\w[\w.]*(?P$(?:[^()]++|(?&aux1))*$)*+)\s*\.\s*((?:remove|contains|retain)(?:All)?)\s*(?&aux1)

The optimization algorithm will make sure only word boundary positions will get tested, and the first char must be a word char.

Why the performance of two regex using capturing groups varies so differently?

Regex Description

How to reproduce

Answers (1)

Related Questions