JohnJ
JohnJ

Reputation: 7056

check if element of list is present in elements of another list

I am having some trouble with lists. So, basically, I have a list:

a=["Britney spears", "red dog", "\xa2xe3"]

and I have another list, looking like:

b = ["cat","dog","red dog is stupid", "good stuff \xa2xe3", "awesome Britney spears"]

what I would like to do is check if elements in a are part of some element in b - and if so, remove them from b's element. So, I would like b to look like:

b = ["cat","dog","is stupid","good stuff","awesome"]

what is the most pythonic (in 2.7.x) way to achieve this?

I am assuming I can loop around to check each element, but I am not sure this is very efficient - I have a list (b) of around size 50k.

Upvotes: 2

Views: 869

Answers (3)

sloth
sloth

Reputation: 101142

Well, I don't know if this counts as pythonic anymore, since reduce got exiled to functools in python3, someone has to put a one-liner on the table:

a = ["Britney spears", "red dog", "\xa2xe3"]
b = ["cat","dog","red dog is stupid", "good stuff \xa2xe3", "awesome Britney spears"]

b = [reduce(lambda acc, n: acc.replace(n, ''), a, x).strip() for x in b]

even faster would be

[reduce(lambda acc, n: acc.replace(n, '') if n in acc else acc, a, x).strip() for x in b]

but as readability decreases, it's getting less pythonic I think.

Here's one that handles the transferred dogcatcher case. I borrowed mgilson's regex, but I think it's OK because it's quite trivial :-):

def reducer(acc, n):
    if n in acc:
        return re.sub('(?:\s+|^)' + re.escape(n) + '(?:\s+|$)', '', acc)
    return acc

b = [reduce(reducer, a, x).strip() for x in b]

I extracted the lambda to a named function for readability.

Upvotes: 2

mgilson
mgilson

Reputation: 310227

I think I'd use a regex here:

import re

a=["Britney spears", "red dog", "\xa2xe3"]

regex = re.compile('|'.join(re.escape(x) for x in a))

b=["cat","dog","red dog is stupid", "good stuff \xa2xe3", "awesome Britney spears"]

b = [regex.sub("",x) for x in b ]
print (b)  #['cat', 'dog', ' is stupid', 'good stuff ', 'awesome ']

This way, the regular expression engine can optimize the test for the list of alternatives.

Here are a bunch of alternatives to show how different regexs behave.

import re

a = ["Britney spears", "red dog", "\xa2xe3"]
b = ["cat","dog",
     "red dog is stupid", 
     "good stuff \xa2xe3", 
     "awesome Britney spears",
     "transferred dogcatcher"]

#This version leaves whitespace and will match between words.
regex = re.compile('|'.join(re.escape(x) for x in a))
c = [regex.sub("",x) for x in b ]
print (c) #['cat', 'dog', ' is stupid', 'good stuff ', 'awesome ', 'transfercatcher']

#This version strips whitespace from either end
# of the returned string
regex = re.compile('|'.join(r'\s*{}\s*'.format(re.escape(x)) for x in a))
c = [regex.sub("",x) for x in b ]
print (c) #['cat', 'dog', 'is stupid', 'good stuff', 'awesome', 'transfercatcher']

#This version will only match at word boundaries,
# but you lose the match with \xa2xe3 since it isn't a word
regex = re.compile('|'.join(r'\s*\b{}\b\s*'.format(re.escape(x)) for x in a))
c = [regex.sub("",x) for x in b ]
print (c) #['cat', 'dog', 'is stupid', 'good stuff \xa2xe3', 'awesome', 'transferred dogcatcher']


#This version finally seems to get it right.  It matches whitespace (or the start
# of the string) and then the "word" and then more whitespace (or the end of the 
# string).  It then replaces that match with nothing -- i.e. it removes the match 
# from the string.
regex = re.compile('|'.join(r'(?:\s+|^)'+re.escape(x)+r'(?:\s+|$)' for x in a))
c = [regex.sub("",x) for x in b ]
print (c) #['cat', 'dog', 'is stupid', 'good stuff', 'awesome', 'transferred dogcatcher']

Upvotes: 4

Silas Ray
Silas Ray

Reputation: 26160

Well, the simplest would be a straight list comprehension, and as long as a is small, it even be a pretty efficient method.

b = [i for i in b if i not in a]

Upvotes: 1

Related Questions