Reputation: 1598
I'm looking for an effective way to solve this problem
Let say we want to find a list of words in a string ignoring the case, but instead of storing the matched string we want a string with the same case as the original list.
For example :
words_to_match = ['heLLo', 'jumP', 'TEST', 'RESEARCH stuff']
text = 'hello this is jUmp test jump and research stuff'
# Result should be {'TEST', 'heLLo', 'jumP', 'RESEARCH stuff'}
Here is my current approach:
words_to_match = ['heLLo', 'jumP', 'TEST', 'RESEARCH stuff']
I convert this to following regex :
regex = re.compile(r'\bheLLo\b|\bjumP\b|\bTEST\b|\bRESEARCH stuff\b', re.IGNORECASE)
Then
word_founds = re.findall(regex,'hello this is jUmp test jump and research stuff')
normalization_dict = {w.lower():w for w in words_to_match}
# normalization dict : {'hello': 'heLLo', 'jump': 'jumP', 'test': 'TEST', 'research stuff': 'RESEARCH stuff'}
final_list = [normalization_dict[w.lower()] for w in word_founds]
# final_list : ['heLLo', 'jumP', 'TEST', 'jumP', 'RESEARCH stuff']
final_result = set(final_list)
# final_result : {'TEST', 'heLLo', 'jumP', 'RESEARCH stuff'}
This is my expected result, I just want to know if there is a faster/more elegant way to solve this problem.
Upvotes: 0
Views: 156
Reputation: 56
This can be done in a single line, if you're still okay with using regex.
results = set(word for word in re.findall(r"[\w']+", text) if word.lower() in [w.lower() for w in words_to_match])
All it's used for here is splitting the text
variable based on word boundaries.
Edit:
You could also use:
import string
results = set(word for word in "".join(c if c not in string.punctuation else " " for c in text).split()
if word.lower() in [w.lower() for w in words_to_match])
if you want to avoid importing re
, but then you have to use string
.
Edit 2: (after properly reading the question, hopefully)
results = set(word for word in words_to_match if word.lower() in text.lower())
This works with multi-word searches as well.
Edit 3:
results = set(word for word in words_to_match if re.search(r"\b" + word.lower() + r"\b", text.lower()))
Upvotes: 2
Reputation: 103
Try this:
words_to_match = ['heLLo', 'jumP', 'TEST']
text = 'hello this is jUmp test jump'
result = set()
for str in words_to_match:
if str.lower() in text.lower():
result.add(str)
Upvotes: 0