Reputation: 7056
I am having some trouble with lists. So, basically, I have a list:
a=["Britney spears", "red dog", "\xa2xe3"]
and I have another list, looking like:
b = ["cat","dog","red dog is stupid", "good stuff \xa2xe3", "awesome Britney spears"]
what I would like to do is check if elements in a
are part of some element in b
- and if so, remove them from b
's element. So, I would like b
to look like:
b = ["cat","dog","is stupid","good stuff","awesome"]
what is the most pythonic (in 2.7.x) way to achieve this?
I am assuming I can loop around to check each element, but I am not sure this is very efficient - I have a list (b
) of around size 50k.
Upvotes: 2
Views: 869
Reputation: 101142
Well, I don't know if this counts as pythonic anymore, since reduce
got exiled to functools
in python3, someone has to put a one-liner on the table:
a = ["Britney spears", "red dog", "\xa2xe3"]
b = ["cat","dog","red dog is stupid", "good stuff \xa2xe3", "awesome Britney spears"]
b = [reduce(lambda acc, n: acc.replace(n, ''), a, x).strip() for x in b]
even faster would be
[reduce(lambda acc, n: acc.replace(n, '') if n in acc else acc, a, x).strip() for x in b]
but as readability decreases, it's getting less pythonic I think.
Here's one that handles the transferred dogcatcher
case. I borrowed mgilson's regex, but I think it's OK because it's quite trivial :-):
def reducer(acc, n):
if n in acc:
return re.sub('(?:\s+|^)' + re.escape(n) + '(?:\s+|$)', '', acc)
return acc
b = [reduce(reducer, a, x).strip() for x in b]
I extracted the lambda
to a named function for readability.
Upvotes: 2
Reputation: 310227
I think I'd use a regex here:
import re
a=["Britney spears", "red dog", "\xa2xe3"]
regex = re.compile('|'.join(re.escape(x) for x in a))
b=["cat","dog","red dog is stupid", "good stuff \xa2xe3", "awesome Britney spears"]
b = [regex.sub("",x) for x in b ]
print (b) #['cat', 'dog', ' is stupid', 'good stuff ', 'awesome ']
This way, the regular expression engine can optimize the test for the list of alternatives.
Here are a bunch of alternatives to show how different regexs behave.
import re
a = ["Britney spears", "red dog", "\xa2xe3"]
b = ["cat","dog",
"red dog is stupid",
"good stuff \xa2xe3",
"awesome Britney spears",
"transferred dogcatcher"]
#This version leaves whitespace and will match between words.
regex = re.compile('|'.join(re.escape(x) for x in a))
c = [regex.sub("",x) for x in b ]
print (c) #['cat', 'dog', ' is stupid', 'good stuff ', 'awesome ', 'transfercatcher']
#This version strips whitespace from either end
# of the returned string
regex = re.compile('|'.join(r'\s*{}\s*'.format(re.escape(x)) for x in a))
c = [regex.sub("",x) for x in b ]
print (c) #['cat', 'dog', 'is stupid', 'good stuff', 'awesome', 'transfercatcher']
#This version will only match at word boundaries,
# but you lose the match with \xa2xe3 since it isn't a word
regex = re.compile('|'.join(r'\s*\b{}\b\s*'.format(re.escape(x)) for x in a))
c = [regex.sub("",x) for x in b ]
print (c) #['cat', 'dog', 'is stupid', 'good stuff \xa2xe3', 'awesome', 'transferred dogcatcher']
#This version finally seems to get it right. It matches whitespace (or the start
# of the string) and then the "word" and then more whitespace (or the end of the
# string). It then replaces that match with nothing -- i.e. it removes the match
# from the string.
regex = re.compile('|'.join(r'(?:\s+|^)'+re.escape(x)+r'(?:\s+|$)' for x in a))
c = [regex.sub("",x) for x in b ]
print (c) #['cat', 'dog', 'is stupid', 'good stuff', 'awesome', 'transferred dogcatcher']
Upvotes: 4
Reputation: 26160
Well, the simplest would be a straight list comprehension, and as long as a
is small, it even be a pretty efficient method.
b = [i for i in b if i not in a]
Upvotes: 1