Sankar
Sankar

Reputation: 586

Python regex ignoring pattern

I have a list of two keywords like below:

keywords = ["Azure", "Azure cloud"]

but python unable to find the second keyword "Azure cloud"

>>> keywords = ["Azure", "Azure cloud"]
>>> r = re.compile('|'.join([re.escape(w) for w in keywords]), flags=re.I)
>>> word = "Azure and Azure cloud"
>>> r.findall(word)
['Azure', 'Azure']

I am expecting the output like this : ['Azure', 'Azure', 'Azure cloud']

Any guide/help would be highly appreciated!

Upvotes: 1

Views: 106

Answers (2)

Boseong Choi
Boseong Choi

Reputation: 2596

You can run multiple search.

import itertools
import re

keywords = ["Azure", "Azure cloud"]
patterns = [re.compile(re.escape(w), flags=re.I) for w in keywords]
word = "Azure and Azure cloud"
results = list(itertools.chain.from_iterable(
    r.findall(word) for r in patterns
))
print(results)

output:

['Azure', 'Azure', 'Azure cloud']

Append

if I'd have word = "Azure and azure cloud" - I will have the output as ['Azure', 'azure', 'azure cloud'] - so the 2nd keyword "azure" which is in small, if i would have to get the exact word matching with the "keywords" list which is "Azure", what modification has to be made in the code?

The flag re.I means ignore-case. So simply remove this.

patterns = [re.compile(re.escape(w)) for w in keywords]

Append 2

sorry my last comment was vague, so I want the pattern matching to ignore the case, but while fetching the output I would want the keywords to have exact case what we have in the "keyword" list and not in the "word"

Sorry for misunderstanding. Try this:

import re

keywords = ["Azure", "azure cloud"]
patterns = [re.compile(w, flags=re.I) for w in keywords]
word = "Azure and azure cloud"
results = [
    match_obj.re.pattern
    for r in patterns
    for match_obj in r.finditer(word)
]
print(results)

output:

['Azure', 'Azure', 'azure cloud']

I'm not sure it is effecient way, but it works.
Note that I remove re.escape because it cause space escape so result was:

['Azure', 'Azure', 'azure\\ cloud']

Upvotes: 1

Masklinn
Masklinn

Reputation: 42492

findall finds all non-overlapping matches. And in case of alternations it tries the various cases left-to-right.

So what happens here is that the regex engine reaches Azure cloud, manages to match Azure and... starts looking for it again in cloud, since it's managed to match Azure to something.

If you expect "Azure and Azure cloud" to yield "Azure", "Azure" and "Azure Cloud" you need to run each pattern individually, not a single alternating pattern.

Upvotes: 2

Related Questions