Reputation: 3811
I have a dataframe train
, with a column tweet_content
. There is a column sentiment
which tells the overall sentiment of tweet. Now there are lot of words which are common in tweets of neutral, positive and negative sentiments. I want to find the words which are unique to each specific sentiment
train
tweet_content sentiment
[PM, you, rock, man] Positive
[PM, you, are, a, total, idiot, man] Negative
[PM, I, have, no, opinion, about, you, dear] Neutral and so on..There are 30,000 rows
P.S. Note that each tweet or row is a list of words for the column tweet_content.
Expected output for the above tweets: (unique_positive, unique_negative etc is for all the tweets in the df. There are 30,000 rows. So unique positive will be the list of words which are unique for positive sentiment for all 30,000 rows combined. Here I have just taken 3 tweets as random eg
unique_positive = [rock] #you and PM occur in Negative and Neutral tweets, man occurs in negative tweet
unique_negative = [are , an, idiot] #you and PM occur in Positive and Neutral tweets, man occurs in positive tweet
unique_positive = [I, have, no, opinion, about, dear] #you and PM occur in Negative and Neutral tweets
where
raw_text = [word for word_list in train['content'] for word in word_list] #list of all words
unique_Positive= words_unique('positive', 20, raw_text) #find 20 unique words which are only in positive sentiment from list of all words
Problem: The below function runs perfectly and finds unique words for positive, neutral and negative sentiments. But the problem is it is taking 30 minutes to run. Is there a way to optimise this function and run it faster?.
Function to find out the unique words for each sentiment:
def words_unique(sentiment,numwords,raw_words):
'''
Input:
segment - Segment category (ex. 'Neutral');
numwords - how many specific words do you want to see in the final result;
raw_words - list for item in train_data[train_data.segments == segments]['tweet_content']:
Output:
dataframe giving information about the numwords number of unique words in a particular sentiment (in descending order based on their counts)..
'''
allother = []
for item in train[train.sentiment != sentiment]['tweet_content']:
for word in item:
allother .append(word)
allother = list(set(allother ))
specificnonly = [x for x in raw_text if x not in allother]
mycounter = Counter()
for item in train[train.sentiment == sentiment]['tweet_content']:
for word in item:
mycounter[word] += 1
keep = list(specificnonly)
for word in list(mycounter):
if word not in keep:
del mycounter[word]
Unique_words = pd.DataFrame(mycounter.most_common(numwords), columns = ['words','count'])
return Unique_words
Upvotes: 2
Views: 264
Reputation: 1441
This should work (add the bells and whistles like filtering for numwords
as you would require it):
Edit (added explainer comments) :
import pandas as pd
df = pd.DataFrame([['Positive','Positive','Negative','Neutral'],[['PM', 'you', 'rock', 'man'],['PM'],['PM', 'you', 'are', 'a', 'total', 'idiot', 'man'] ,['PM', 'I', 'have', 'no', 'opinion', 'about', 'you', 'dear']]]).T
df.columns = ['sentiment','tweet']
# join the list back to a sentence
df['tweet'] = df['tweet'].apply(lambda x: " ".join(x))
# join all the sentences in a group (i.e. sentiment) and then get unique words
_df = df.groupby(['sentiment']).agg({'tweet':lambda x: set(" ".join(x).split(" "))})['tweet']
# group by gives a row per sentiment
neg = _df[0]; neu = _df[1]; pos = _df[2]
# basically, A *minus* (B *union* C)
uniq_pos = pos - (neg.union(neu))
uniq_neu = neu - (pos.union(neg))
uniq_neg = neg - (pos.union(neu))
uniq_pos, uniq_neu, uniq_neg
Upvotes: 2