Combine words from list to single regex with word boundary

Question

I have a list l = [AA, CC, DD, EE]

And I have a lot of strings from a file where I want to find the strings that have any of the exact words from the list. I do not want to get the words that matched in a particular string. Reading other SO questions, I get suggestions to combine the list into a single regex mainly in the following two ways

1. \bAA\b|\bCC\b|\bDD\b|\bEE\b     ==> r"\b%s\b" % r"\b|\b".join(l)
2. \b(?:AA|CC|DD|EE)\b             ==> r"\b(?:%s)\b" % "|".join(l)

The joins mentioned above on the right are just as an example and is not part of the question.

Running the code, both of them give the same correct answer and timit gives similar timings. If I do not care about the word that matched from the list, is grouping necessary as in option#2? Why are the word boundaries at the ends in option#2? Does it mean that it is applicable to all the words inside the parenthesis i.e. equivalent to (?:\bAA\b|\bCC\b|\bDD\b|\bEE\b)? Can anyone point to a link that mentions this property of parenthesis? Is any of the two options more correct/pythonic?

Tim Biegeleisen · Accepted Answer

The two versions are logically identical, should produce identical results, and should also have similar performance. The version you should actually use is the second one:

\b(?:AA|CC|DD|EE)\b

The reason is that it is more terse, and avoids unnecessarily repeating the word boundary for each term in the alternation. This regex says to match any one of the terms in the alternation, with word boundaries on both ends. Regarding the "group," the ?: inside the parentheses actually turns off the capture group, so at least from a performance point of view, it is not really there. The parentheses are required to avoid repeating the word boundaries for each term, which is what the first version is doing.

Combine words from list to single regex with word boundary

Answers (1)

Related Questions