fredoverflow
fredoverflow

Reputation: 263350

Are non-capturing groups redundant?

Are optional non-capturing groups redundant?

Is the following regex:

(?:wo)?men

semantically equivalent to the following regex?

(wo)?men

Upvotes: 9

Views: 2001

Answers (2)

Grismar
Grismar

Reputation: 31389

A question elsewhere was asking the same and I provided an answer with an example in Python:

It doesn't "have the same effect" - in one case the group is captured and accessible, in the other it is only used to complete the match.

People use non-capturing groups when they are not interesting in accessing the value of the group - to save space for situations with many matches, but also for better performance in cases where the regex engine is optimised for it.

A useless example in Python to illustrate the point:

from timeit import timeit
import re

chars = 'abcdefghij'
s = ''.join(chars[i % len(chars)] for i in range(100000))


def capturing():
    re.findall('(a(b(c(d(e(f(g(h(i(j))))))))))', s)


def noncapturing():
    re.findall('(?:a(?:b(?:c(?:d(?:e(?:f(?:g(?:h(?:i(j))))))))))', s)


print(timeit(capturing, number=1000))
print(timeit(noncapturing, number=1000))

Output:

5.8383678999998665
1.0528525999998237

Note: this is in spite of PyCharm (if you happen to use it) warning "Unnecessary non-capturing group" - the warning is correct, but not the whole truth, clearly. It's logically unneeded, but definitely does not have the same practical effect.

If the reason you wanted to get rid of them was to suppress such warnings, PyCharm allows you to do so with this:

# noinspection RegExpUnnecessaryNonCapturingGroup
re.findall('(?:a(?:b(?:c(?:d(?:e(?:f(?:g(?:h(?:i(j))))))))))', s)

Another note for the pedantic: the examples above aren't strictly logically equivalent either. But they match the same strings, just with different results.

c = re.findall('(a(b(c(d(e(f(g(h(i(j))))))))))', s)
nc = re.findall('(?:a(?:b(?:c(?:d(?:e(?:f(?:g(?:h(?:i(j))))))))))', s)

c is a list of 10-tuples ([('abcdefghij', 'bcdefghij', ..), ..]), while nc is a list of single strings (['j', ..]).

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627409

Your (?:wo)?men and (wo)?men are semantically equivalent, but technically are different, namely, the first is using a non-capturing and the other a capturing group. Thus, the question is why use non-capturing groups when we have capturing ones?

Non-caprturing groups are of help sometimes.

  1. To avoid excessive number of backreferences (remember that it is sometimes difficult to use backreferences higher than 9)
  2. To avoid the problem with 99 numbered backreferences limit (by reducing the number of numbered capturing groups) (source: Regular-expressions.info: Most regex flavors support up to 99 capturing groups and double-digit backreferences.)
    NOTE this does not pertain to Java regex engine, nor to PHP or .NET regex engines.
  3. To lessen the overhead caused by storing the captures in the stack
  4. We can add more groupings to existing regex without ruining the order of capturing groups.

Also, it is just makes our matches cleaner:

You can use a non-capturing group to retain the organisational or grouping benefits but without the overhead of capturing.

It does not seem a good idea to re-factor existing regular expressions to convert capturing to non-capturing groups, since it may ruin the code or require too much effort.

Upvotes: 12

Related Questions