DhruvPathak
DhruvPathak

Reputation: 43245

Why is regex search slower with capturing groups in Python?

I have an application code which generates regexes dynamically from a config for some parsing. When timing performance of two variations, the regex variation with each part of an OR regex being captured is noticably slow than a normal regex. The reason would be overhead of certain operations internally in regex module.

>>> import timeit
>>> setup = '''
... import re
... '''   

#no capture group 
>>> print(timeit.timeit("re.search(r'hello|bye|ola|cheers','some say hello,some say bye, or ola or cheers!')", setup=setup))
0.922958850861

#with capture group
>>> print(timeit.timeit("re.search(r'(hello)|(bye)|(ola)|(cheers)','some say hello,some say bye, or ola or cheers!')", setup=setup))
1.44321084023

#no capture group
>>> print(timeit.timeit("re.search(r'hello|bye|ola|cheers','some say hello,some say bye, or ola or cheers!')", setup=setup))
0.913202047348

# capture group
>>> print(timeit.timeit("re.search(r'(hello)|(bye)|(ola)|(cheers)','some say hello,some say bye, or ola or cheers!')", setup=setup))
1.41544604301

Question: What causes this considerable drop in performance when using capture groups ?

Upvotes: 21

Views: 3015

Answers (2)

Federico Piazza
Federico Piazza

Reputation: 31005

The reason is pretty simple, using capturing groups indicate the Engine to save the content in memory, while using non capturing group indicates the engine to not save anything. Consider that you are telling the engine to perform more operations.

For instance, using this regex (hello|bye|ola|cheers) or (hello)|(bye)|(ola)|(cheers) will impact considerably higher than using an atomic group or a non capturing one like (?:hello|bye|ola|cheers).

When using regex you know if you want to capture or not capture content like the case above. If you want to capture any of those words, you will lose performance but if you don't need to capture content then you can save performance by improving it like using non-capturing groups

I know you tagged python, but have have prepared an online benchmark for javascript to show how capturing and non-capturing groups impacts in the js regex engine.

https://jsperf.com/capturing-groups-vs-non-capturing-groups

enter image description here

Upvotes: 22

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626929

Your patterns only differ in the capturing groups. When you define a capturing group in the regex pattern and use the pattern with re.search, the result will be a MatchObject instance. Each match object will contain as many groups as there are capturing groups in the pattern, even if they are empty. That is the overhead for the re internals: adding the (list of) groups (memory allocation, etc.). Mind that groups also contain such details as the starting and ending index of the text that they match and more (refer to the MatchObject reference).

Upvotes: 12

Related Questions