Reputation: 586
I have a list of two keywords like below:
keywords = ["Azure", "Azure cloud"]
but python unable to find the second keyword "Azure cloud"
>>> keywords = ["Azure", "Azure cloud"]
>>> r = re.compile('|'.join([re.escape(w) for w in keywords]), flags=re.I)
>>> word = "Azure and Azure cloud"
>>> r.findall(word)
['Azure', 'Azure']
I am expecting the output like this : ['Azure', 'Azure', 'Azure cloud']
Any guide/help would be highly appreciated!
Upvotes: 1
Views: 106
Reputation: 2596
You can run multiple search.
import itertools
import re
keywords = ["Azure", "Azure cloud"]
patterns = [re.compile(re.escape(w), flags=re.I) for w in keywords]
word = "Azure and Azure cloud"
results = list(itertools.chain.from_iterable(
r.findall(word) for r in patterns
))
print(results)
output:
['Azure', 'Azure', 'Azure cloud']
if I'd have word = "Azure and azure cloud" - I will have the output as ['Azure', 'azure', 'azure cloud'] - so the 2nd keyword "azure" which is in small, if i would have to get the exact word matching with the "keywords" list which is "Azure", what modification has to be made in the code?
The flag re.I
means ignore-case. So simply remove this.
patterns = [re.compile(re.escape(w)) for w in keywords]
sorry my last comment was vague, so I want the pattern matching to ignore the case, but while fetching the output I would want the keywords to have exact case what we have in the "keyword" list and not in the "word"
Sorry for misunderstanding. Try this:
import re
keywords = ["Azure", "azure cloud"]
patterns = [re.compile(w, flags=re.I) for w in keywords]
word = "Azure and azure cloud"
results = [
match_obj.re.pattern
for r in patterns
for match_obj in r.finditer(word)
]
print(results)
output:
['Azure', 'Azure', 'azure cloud']
I'm not sure it is effecient way, but it works.
Note that I remove re.escape because it cause space escape so result was:
['Azure', 'Azure', 'azure\\ cloud']
Upvotes: 1
Reputation: 42492
findall
finds all non-overlapping matches. And in case of alternations it tries the various cases left-to-right.
So what happens here is that the regex engine reaches Azure cloud
, manages to match Azure
and... starts looking for it again in cloud
, since it's managed to match Azure
to something.
If you expect "Azure and Azure cloud" to yield "Azure", "Azure" and "Azure Cloud" you need to run each pattern individually, not a single alternating pattern.
Upvotes: 2