Sankar
Sankar

Reputation: 586

Extracting matches with the original case used in the pattern during a case insensitive search

While doing a regex pattern match, we get the content which has been a match. What if I want the pattern which was found in the content?

See the below example:

>>> import re
>>> r = re.compile('ERP|Gap', re.I)
>>> string = 'ERP is integral part of GAP, so erp can never be ignored, ErP!'
>>> r.findall(string)
['ERP', 'GAP', 'erp', 'ErP']

but I want the output to look like this : ['ERP', 'Gap', 'ERP', 'ERP']

Because if I do a group by and sum on the original output, I would get the following output as a dataframe:

ERP 1
erp 1
ErP 1
GAP 1
gap 1

But what if I want the output to look like

ERP 3
Gap 2

in par with the keywords I am searching for?

MORE CONTEXT

I have a keyword list like this: ['ERP', 'Gap']. I have a string like this: "ERP, erp, ErP, GAP, gap"

I want to take count of number of times each keyword has appeared in the string. Now if I am doing a pattern matching, I am getting the following output: [ERP, erp, ErP, GAP, gap].

Now if I want to aggregate and take a count, I am getting the following dataframe:

ERP 1
erp 1
ErP 1
GAP 1
gap 1

While I want the output to look like this:

ERP 3
Gap 2

Upvotes: 1

Views: 423

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626926

You may build the pattern dynamically to include indices of the words you search for in the group names and then grab those pattern parts that matched:

import re

words = ["ERP", "Gap"]
words_dict = { f'g{i}':item for i,item in enumerate(words) } 

rx = rf"\b(?:{'|'.join([ rf'(?P<g{i}>{item})' for i,item in enumerate(words) ])})\b"

text = 'ERP is integral part of GAP, so erp can never be ignored, ErP!'

results = []
for match in re.finditer(rx, text, flags=re.IGNORECASE):
    results.append( [words_dict.get(key) for key,value in match.groupdict().items() if value][0] )

print(results) # => ['ERP', 'Gap', 'ERP', 'ERP']

See the Python demo online

The pattern will look like \b(?:(?P<g0>ERP)|(?P<g1>Gap))\b:

  • \b - a word boundary
  • (?: - start of a non-capturing group encapsulating pattern parts:
    • (?P<g0>ERP) - Group "g0": ERP
    • | - or
    • (?P<g1>Gap) - Group "g1": Gap
  • ) - end of the group
  • \b - a word boundary.

See the regex demo.

Note [0] with [words_dict.get(key) for key,value in match.groupdict().items() if value][0] will work in all cases since when there is a match, only one group matched.

Upvotes: 4

Sharad
Sharad

Reputation: 10612

Refer comments above. Try:

>>> [x.upper() for x in r.findall(string)]
['ERP', 'GAP', 'ERP', 'ERP']
>>>

OR

>>> map(lambda x: x.upper(), r.findall(string))
['ERP', 'GAP', 'ERP', 'ERP']
>>> 

Upvotes: -3

Related Questions