Reputation: 1185
I have some twitter data and I split the text into those with happy emoticons and sad emoticons elegantly and pythonically like so:
happy_set = [":)",":-)","=)",":D",":-D","=D"]
sad_set = [":(",":-(","=("]
happy = [tweet.split() for tweet in data for face in happy_set if face in tweet]
sad = [tweet.split() for tweet in data for face in sad_set if face in tweet]
This works, however, it could be the case that both an emoticon from the happy_set
and sad_set
could be found in a single tweet. What is the pythonic way to ensure that the happy
list only contains emoticons from the happy_set
and vice versa?
Upvotes: 1
Views: 147
Reputation: 176830
You could try using sets, specifically set.isdisjoint
. Check to see if the set of tokens in a happy tweet is disjoint from sad_set
. If so, it definitely belongs in happy
:
happy_set = set([":)",":-)","=)",":D",":-D","=D"])
sad_set = set([":(",":-(","=("])
# happy is your existing set of potentially happy tweets. To remove any tweets with sad tokens...
happy = [tweet for tweet in happy if sad_set.isdisjoint(set(tweet.split()))]
Upvotes: 3
Reputation: 2944
I would use lambdas :
>>> is_happy = lambda tweet: any(map(lambda x: x in happy_set, tweet.split()))
>>> is_sad = lambda tweet: any(map(lambda x: x in sad_set, tweet.split()))
>>> data = ["Hi, I am sad :( but don't worry =D", "Happy day :-)", "Boooh :-("]
>>> filter(lambda tweet: is_happy(tweet) and not is_sad(tweet), data)
['Happy day :-)']
>>> filter(lambda tweet: is_sad(tweet) and not is_happy(tweet), data)
['Boooh :-(']
That will avoid creating intermediary copies of data
.
And if data
is really big you can replace filter
by an ifilter
from the package itertools
to get an iterator instead of a list.
Upvotes: 1
Reputation: 52000
Is that you are looking for?
happy_set = set([":)",":-)","=)",":D",":-D","=D"])
sad_set = set([":(",":-(","=("])
happy_maybe_sad = [tweet.split() for tweet in data for face in happy_set if face in tweet]
sad_maybe_happy = [tweet.split() for tweet in data for face in sad_set if face in tweet]
happy = [item for item in happy_maybe_sad if not in sad_maybe_happy]
sad = [item for item in sad_maybe_happy if not in happy_maybe_sad]
For happy...
and sad...
, I stick with the list solution as the item's order is maybe relevant. If not, it might be better using set()
for performances though. Is additions, sets already provides the basic sets operations (unions, intersection, etc.)
Upvotes: 0