noob
noob

Reputation: 3811

Tweets analysis: Get unique positive, unique negative and unique neutral words : Optimised solution:Natural Language processing:

I have a dataframe train, with a column tweet_content. There is a column sentiment which tells the overall sentiment of tweet. Now there are lot of words which are common in tweets of neutral, positive and negative sentiments. I want to find the words which are unique to each specific sentiment

train

tweet_content                                sentiment 
[PM, you, rock, man]                         Positive
[PM, you, are, a, total, idiot, man]         Negative
[PM, I, have, no, opinion, about, you, dear] Neutral and so on..There are 30,000 rows

P.S. Note that each tweet or row is a list of words for the column tweet_content.

Expected output for the above tweets: (unique_positive, unique_negative etc is for all the tweets in the df. There are 30,000 rows. So unique positive will be the list of words which are unique for positive sentiment for all 30,000 rows combined. Here I have just taken 3 tweets as random eg

unique_positive = [rock] #you and PM occur in Negative and Neutral tweets, man occurs in negative tweet
unique_negative = [are , an, idiot] #you and PM occur in Positive and Neutral tweets, man occurs in positive tweet 
unique_positive = [I, have, no, opinion, about, dear] #you and PM occur in Negative and Neutral tweets

where

raw_text = [word for word_list in train['content'] for word in word_list] #list of all words
unique_Positive= words_unique('positive', 20, raw_text) #find 20 unique words which are only in positive sentiment from list of all words 

Problem: The below function runs perfectly and finds unique words for positive, neutral and negative sentiments. But the problem is it is taking 30 minutes to run. Is there a way to optimise this function and run it faster?.

Function to find out the unique words for each sentiment:

def words_unique(sentiment,numwords,raw_words):
    '''
    Input:
        segment - Segment category (ex. 'Neutral');
        numwords - how many specific words do you want to see in the final result; 
        raw_words - list  for item in train_data[train_data.segments == segments]['tweet_content']:
    Output: 
        dataframe giving information about the numwords number of unique words in a particular sentiment (in descending order based on their counts)..

    '''
    allother = []
    for item in train[train.sentiment != sentiment]['tweet_content']:
        for word in item:
            allother .append(word)
    allother  = list(set(allother ))
    
    specificnonly = [x for x in raw_text if x not in allother]
    
    mycounter = Counter()
    
    for item in train[train.sentiment == sentiment]['tweet_content']:
        for word in item:
            mycounter[word] += 1
    keep = list(specificnonly)
    
    for word in list(mycounter):
        if word not in keep:
            del mycounter[word]
    
    Unique_words = pd.DataFrame(mycounter.most_common(numwords), columns = ['words','count'])
    
    return Unique_words

Upvotes: 2

Views: 264

Answers (1)

Partha Mandal
Partha Mandal

Reputation: 1441

This should work (add the bells and whistles like filtering for numwords as you would require it):

Edit (added explainer comments) :

import pandas as pd
df = pd.DataFrame([['Positive','Positive','Negative','Neutral'],[['PM', 'you', 'rock', 'man'],['PM'],['PM', 'you', 'are', 'a', 'total', 'idiot', 'man'] ,['PM', 'I', 'have', 'no', 'opinion', 'about', 'you', 'dear']]]).T
df.columns = ['sentiment','tweet']
# join the list back to a sentence
df['tweet'] = df['tweet'].apply(lambda x: " ".join(x))


# join all the sentences in a group (i.e. sentiment) and then get unique words
_df = df.groupby(['sentiment']).agg({'tweet':lambda x: set(" ".join(x).split(" "))})['tweet']
# group by gives a row per sentiment
neg = _df[0]; neu = _df[1]; pos = _df[2]

# basically, A *minus* (B *union* C)
uniq_pos = pos - (neg.union(neu))
uniq_neu = neu - (pos.union(neg))
uniq_neg = neg - (pos.union(neu))

uniq_pos, uniq_neu, uniq_neg

Upvotes: 2

Related Questions