Reputation: 586
While doing a regex pattern match, we get the content which has been a match. What if I want the pattern which was found in the content?
See the below example:
>>> import re
>>> r = re.compile('ERP|Gap', re.I)
>>> string = 'ERP is integral part of GAP, so erp can never be ignored, ErP!'
>>> r.findall(string)
['ERP', 'GAP', 'erp', 'ErP']
but I want the output to look like this : ['ERP', 'Gap', 'ERP', 'ERP']
Because if I do a group by and sum on the original output, I would get the following output as a dataframe:
ERP 1
erp 1
ErP 1
GAP 1
gap 1
But what if I want the output to look like
ERP 3
Gap 2
in par with the keywords I am searching for?
MORE CONTEXT
I have a keyword list like this: ['ERP', 'Gap']
. I have a string like this: "ERP, erp, ErP, GAP, gap"
I want to take count of number of times each keyword has appeared in the string. Now if I am doing a pattern matching, I am getting the following output: [ERP, erp, ErP, GAP, gap]
.
Now if I want to aggregate and take a count, I am getting the following dataframe:
ERP 1
erp 1
ErP 1
GAP 1
gap 1
While I want the output to look like this:
ERP 3
Gap 2
Upvotes: 1
Views: 423
Reputation: 626926
You may build the pattern dynamically to include indices of the words you search for in the group names and then grab those pattern parts that matched:
import re
words = ["ERP", "Gap"]
words_dict = { f'g{i}':item for i,item in enumerate(words) }
rx = rf"\b(?:{'|'.join([ rf'(?P<g{i}>{item})' for i,item in enumerate(words) ])})\b"
text = 'ERP is integral part of GAP, so erp can never be ignored, ErP!'
results = []
for match in re.finditer(rx, text, flags=re.IGNORECASE):
results.append( [words_dict.get(key) for key,value in match.groupdict().items() if value][0] )
print(results) # => ['ERP', 'Gap', 'ERP', 'ERP']
See the Python demo online
The pattern will look like \b(?:(?P<g0>ERP)|(?P<g1>Gap))\b
:
\b
- a word boundary(?:
- start of a non-capturing group encapsulating pattern parts:
(?P<g0>ERP)
- Group "g0": ERP
|
- or (?P<g1>Gap)
- Group "g1": Gap
)
- end of the group\b
- a word boundary.See the regex demo.
Note [0]
with [words_dict.get(key) for key,value in match.groupdict().items() if value][0]
will work in all cases since when there is a match, only one group matched.
Upvotes: 4
Reputation: 10612
Refer comments above. Try:
>>> [x.upper() for x in r.findall(string)]
['ERP', 'GAP', 'ERP', 'ERP']
>>>
OR
>>> map(lambda x: x.upper(), r.findall(string))
['ERP', 'GAP', 'ERP', 'ERP']
>>>
Upvotes: -3