Extracting matches with the original case used in the pattern during a case insensitive search

Question

While doing a regex pattern match, we get the content which has been a match. What if I want the pattern which was found in the content?

See the below example:

>>> import re
>>> r = re.compile('ERP|Gap', re.I)
>>> string = 'ERP is integral part of GAP, so erp can never be ignored, ErP!'
>>> r.findall(string)
['ERP', 'GAP', 'erp', 'ErP']

but I want the output to look like this : ['ERP', 'Gap', 'ERP', 'ERP']

Because if I do a group by and sum on the original output, I would get the following output as a dataframe:

ERP 1
erp 1
ErP 1
GAP 1
gap 1

But what if I want the output to look like

ERP 3
Gap 2

in par with the keywords I am searching for?

MORE CONTEXT

I have a keyword list like this: ['ERP', 'Gap']. I have a string like this: "ERP, erp, ErP, GAP, gap"

I want to take count of number of times each keyword has appeared in the string. Now if I am doing a pattern matching, I am getting the following output: [ERP, erp, ErP, GAP, gap].

Now if I want to aggregate and take a count, I am getting the following dataframe:

ERP 1
erp 1
ErP 1
GAP 1
gap 1

While I want the output to look like this:

ERP 3
Gap 2

Wiktor Stribiżew · Accepted Answer

You may build the pattern dynamically to include indices of the words you search for in the group names and then grab those pattern parts that matched:

import re

words = ["ERP", "Gap"]
words_dict = { f'g{i}':item for i,item in enumerate(words) } 

rx = rf"\b(?:{'|'.join([ rf'(?P{item})' for i,item in enumerate(words) ])})\b"

text = 'ERP is integral part of GAP, so erp can never be ignored, ErP!'

results = []
for match in re.finditer(rx, text, flags=re.IGNORECASE):
    results.append( [words_dict.get(key) for key,value in match.groupdict().items() if value][0] )

print(results) # => ['ERP', 'Gap', 'ERP', 'ERP']

See the Python demo online

The pattern will look like \b(?:(?PERP)|(?PGap))\b:

\b - a word boundary
(?: - start of a non-capturing group encapsulating pattern parts:
- (?PERP) - Group "g0": ERP
- | - or
- (?PGap) - Group "g1": Gap
) - end of the group
\b - a word boundary.

See the regex demo.

Note [0] with [words_dict.get(key) for key,value in match.groupdict().items() if value][0] will work in all cases since when there is a match, only one group matched.

Extracting matches with the original case used in the pattern during a case insensitive search

Answers (2)

Related Questions