Jay
Jay

Reputation: 79

Python comparing two lists and filtering items

I would like to do some word filtering (extracting only items in 'keyword' list that exist in 'whitelist').

Here is my code so far:

whitelist = ['Cat', 'Dog', 'Cow']
keyword = ['Cat, Cow, Horse', 'Bird, Whale, Dog', 'Pig, Chicken', 'Tiger, Cat']
keyword_filter = []
 
for word in whitelist:
    for i in range(len(keyword)):
        if word in keyword[i]:
            keyword_filter.append(word)
        else: pass

I want to remove every word except for 'Cat', 'Dog', and 'Cow' (which are in the 'whitelist') so that the result ('keyword_filter' list) will look like this:

['Cat, Cow', 'Dog', '', 'Cat']

However, I got the result something like this:

['Cat', 'Cat', 'Dog', 'Cow']

I would sincerely appreciate if you can give some advice.

Upvotes: 2

Views: 1475

Answers (5)

Moses Koledoye
Moses Koledoye

Reputation: 78564

You need to split the strings in the list and check if word in the split is contained in the whitelist. Then rejoin all words in the whitelist after filtering:

whitelist = {'Cat', 'Dog', 'Cow'}
filtered = []
for words in keyword:
    filtered.append(', '.join(w for w in words.split(', ') if w in whitelist))

print(filtered)
# ['Cat, Cow', 'Dog', '', 'Cat']

Better to make whitelist a set to improve the performance for lookup of each word.

You could also use re.findall to find all parts of each word matching strings contained in the whitelist, and then rejoin after finding the matches:

import re

pattern = re.compile(',?\s?Cat|,?\s?Dog|,?\s?Cow')
filtered = [''.join(pattern.findall(words))) for words in keyword]

Upvotes: 3

colopop
colopop

Reputation: 373

Since you want to preserve the order of your keyword list, you'll want to have that as the outermost loop.

for phrase in keyword:

Now you need to split up the phrase into its actual words and determine if those words are in the whitelist. Then you need to put the words back together. You can do this in one line.

   filtered = ", ".join(word in phrase.split(", ") if word in whitelist)

Breakdown: phrase.split(", ") gives you a list of strings that were separated by ", " in the original string -- i.e. the words you care about. word in ... if word in whitelist is a list comprehension. It will return a list of each word in ..., in this case phrase.split, that meets the condition word in whitelist. Finally, ", ".join(...) gives you a string made up of every element in the list ... connected by ", ".

Lastly, you need to put the newly filtered string into your list of filtered strings.

   keyword_filter.append(filtered)

As a sidenote, I agree with others that you should use a set for your collection of whitelisted words. It has much faster lookup time. However, for a miniscule list of words like this example you won't notice a performance difference.

Upvotes: 1

mrCarnivore
mrCarnivore

Reputation: 5078

You could use regex:

import re

whitelist = ['Cat', 'Dog', 'Cow']
keyword = ['Cat, Cow, Horse', 'Bird, Whale, Dog', 'Pig, Chicken', 'Tiger, Cat']
keyword_filter = []

for words in keyword:
    match = re.findall('(' + r'|'.join(whitelist) + ')[,\s]*', words)
    keyword_filter.append(', '.join(match))
print(keyword_filter)

Upvotes: 0

RomanPerekhrest
RomanPerekhrest

Reputation: 92884

Simple list comprehension:

whitelist = ['Cat', 'Dog', 'Cow']
keyword = ['Cat, Cow, Horse', 'Bird, Whale, Dog', 'Pig, Chicken', 'Tiger, Cat']
keyword_filter = [', '.join(w for w in k.split(', ') if w in whitelist) for k in keyword]

print(keyword_filter)

The output:

['Cat, Cow', 'Dog', '', 'Cat']

Upvotes: 1

segFaulter
segFaulter

Reputation: 180

try this..

whitelist = ['Cat', 'Dog', 'Cow']
keyword = ['Cat, Cow, Horse', 'Bird, Whale, Dog', 'Pig, Chicken', 'Tiger, Cat']
keyword_filter = []

for word in keyword:
    whitelistedWords = []
    for w in word.split(', '):
        if w in whitelist:
            whitelistedWords.append(w)
            #print whitelistedWords
    keyword_filter.append( ', '.join(whitelistedWords) )

print keyword_filter

Upvotes: 1

Related Questions