Reputation: 79
I would like to do some word filtering (extracting only items in 'keyword' list that exist in 'whitelist').
Here is my code so far:
whitelist = ['Cat', 'Dog', 'Cow']
keyword = ['Cat, Cow, Horse', 'Bird, Whale, Dog', 'Pig, Chicken', 'Tiger, Cat']
keyword_filter = []
for word in whitelist:
for i in range(len(keyword)):
if word in keyword[i]:
keyword_filter.append(word)
else: pass
I want to remove every word except for 'Cat', 'Dog', and 'Cow' (which are in the 'whitelist') so that the result ('keyword_filter' list) will look like this:
['Cat, Cow', 'Dog', '', 'Cat']
However, I got the result something like this:
['Cat', 'Cat', 'Dog', 'Cow']
I would sincerely appreciate if you can give some advice.
Upvotes: 2
Views: 1475
Reputation: 78564
You need to split the strings in the list and check if word in the split is contained in the whitelist. Then rejoin all words in the whitelist after filtering:
whitelist = {'Cat', 'Dog', 'Cow'}
filtered = []
for words in keyword:
filtered.append(', '.join(w for w in words.split(', ') if w in whitelist))
print(filtered)
# ['Cat, Cow', 'Dog', '', 'Cat']
Better to make whitelist
a set to improve the performance for lookup of each word.
You could also use re.findall
to find all parts of each word matching strings contained in the whitelist, and then rejoin after finding the matches:
import re
pattern = re.compile(',?\s?Cat|,?\s?Dog|,?\s?Cow')
filtered = [''.join(pattern.findall(words))) for words in keyword]
Upvotes: 3
Reputation: 373
Since you want to preserve the order of your keyword list, you'll want to have that as the outermost loop.
for phrase in keyword:
Now you need to split up the phrase into its actual words and determine if those words are in the whitelist. Then you need to put the words back together. You can do this in one line.
filtered = ", ".join(word in phrase.split(", ") if word in whitelist)
Breakdown: phrase.split(", ")
gives you a list of strings that were separated by ", " in the original string -- i.e. the words you care about. word in ... if word in whitelist
is a list comprehension. It will return a list of each word in ...
, in this case phrase.split
, that meets the condition word in whitelist
. Finally, ", ".join(...)
gives you a string made up of every element in the list ...
connected by ", ".
Lastly, you need to put the newly filtered string into your list of filtered strings.
keyword_filter.append(filtered)
As a sidenote, I agree with others that you should use a set
for your collection of whitelisted words. It has much faster lookup time. However, for a miniscule list of words like this example you won't notice a performance difference.
Upvotes: 1
Reputation: 5078
You could use regex:
import re
whitelist = ['Cat', 'Dog', 'Cow']
keyword = ['Cat, Cow, Horse', 'Bird, Whale, Dog', 'Pig, Chicken', 'Tiger, Cat']
keyword_filter = []
for words in keyword:
match = re.findall('(' + r'|'.join(whitelist) + ')[,\s]*', words)
keyword_filter.append(', '.join(match))
print(keyword_filter)
Upvotes: 0
Reputation: 92884
Simple list comprehension:
whitelist = ['Cat', 'Dog', 'Cow']
keyword = ['Cat, Cow, Horse', 'Bird, Whale, Dog', 'Pig, Chicken', 'Tiger, Cat']
keyword_filter = [', '.join(w for w in k.split(', ') if w in whitelist) for k in keyword]
print(keyword_filter)
The output:
['Cat, Cow', 'Dog', '', 'Cat']
Upvotes: 1
Reputation: 180
try this..
whitelist = ['Cat', 'Dog', 'Cow']
keyword = ['Cat, Cow, Horse', 'Bird, Whale, Dog', 'Pig, Chicken', 'Tiger, Cat']
keyword_filter = []
for word in keyword:
whitelistedWords = []
for w in word.split(', '):
if w in whitelist:
whitelistedWords.append(w)
#print whitelistedWords
keyword_filter.append( ', '.join(whitelistedWords) )
print keyword_filter
Upvotes: 1