Reputation: 873
I have a very large data frame full of song lyrics. I've tokenized the lyrics column so each row is a list of lyrics, i.e. ["You", "say", "goodbye", "and", "I", "say", "hello"]
and so on. I wrote a function to calculate the sentiment score using a list of positive words and negative words. I then need to apply this function to the lyrics column to calculate positive sentiment, negative sentiment, and net sentiment and make them new columns.
I attempted to split my data frame into a list of chunks of 1000 and then loop through to apply, but it is still taking a fairly long time. I'm wondering if there is a more efficient way that I should be doing this, or if this is as good as it gets and I just have to wait it out.
def sentiment_scorer(row):
pos=neg=0
for item in row['lyrics']:
# count positive words
if item in positiv:
pos += 1
# count negative words
elif item in negativ:
neg += 1
# ignore words that are neither negative nor positive
else:
pass
# set sentiment to 0 if pos is 0
if pos < 1:
pos_sent = 0
else:
pos_sent = pos / len(row['lyrics'])
# set sentiment to 0 if neg is 0
if neg < 1:
neg_sent = 0
else:
neg_sent = neg / len(row['lyrics'])
# return positive and negative sentiment to make new columns
return pos_sent, neg_sent
# chunk data frames
n = 1000
list_df = [lyrics_cleaned_df[i:i+n] for i in range(0,lyrics_cleaned_df.shape[0],n)]
for lr in range(len(list_df)):
# credit for method: toto_tico on Stack Overflow https://stackoverflow.com/a/46197147
list_df[lr]['positive_sentiment'], list_df[lr]['negative_sentiment'] = zip(*list_df[lr].apply(sentiment_scorer, axis=1))
list_df[lr]['net_sentiment'] = list_df[lr]['positive_sentiment'] - list_df[lr]['negative_sentiment']
ETA: sample data frame
data = [['ego-remix', 2009, 'beyonce-knowles', 'Pop', ['oh', 'baby', 'how']],
['then-tell-me', 2009, 'beyonce-knowles', 'Pop', ['playin', 'everything', 'so']],
['honesty', 2009, 'beyonce-knowles', 'Pop', ['if', 'you', 'search']]]
df = pd.DataFrame(data, columns = ['song', 'year', 'artist', 'genre', 'lyrics'])
Upvotes: 1
Views: 390
Reputation: 1811
If I understand the problem correctly and using your example (I added a few more words to create uneven length lists). You can create a separate dataframe lyrics
, converting the words from your lyrics into separate columns.
data = [['ego-remix', 2009, 'beyonce-knowles', 'Pop', ['oh', 'baby', 'how', "d"]],
['then-tell-me', 2009, 'beyonce-knowles', 'Pop', ['playin', 'everything', 'so']],
['honesty', 2009, 'beyonce-knowles', 'Pop', ['if', 'you', 'search']]]
df = pd.DataFrame(data, columns = ['song', 'year', 'artist', 'genre', 'lyrics'])
Then define lyrics
.
lyrics = pd.DataFrame(df.lyrics.values.tolist())
# 0 1 2 3
# 0 oh baby how d
# 1 playin everything so None # Null rows need to be accounted for
# 2 if you search None # Null rows need to be accounted for
Then, if you have two lists with your positive and negative sentiment words, like below, you can to calculate the sentiment per row (lyric), using the mean()
method.
# positive and negative sentiment words
pos = ['baby', 'you']
neg = ['if', 'so']
# When converting the lyrics list to a new dataframe, it will contain Null values
# when the length of the lists are not the same. Therefore these need to be scaled
# according to the proportion of null values
null_rows = lyrics.notnull().mean(1)
# Calculate the proportion of positive and negative words, accounting for null values
pos_sent = lyrics.isin(pos).mean(1) / null_rows
neg_sent = lyrics.isin(neg).mean(1) / null_rows
# pos_sent
# 0 0.250000
# 1 0.000000
# 2 0.333333
# neg_sent
# 0 0.000000
# 1 0.333333
# 2 0.333333
If I understand your problem fully, then you should be able to use df['pos'] = pos_sent
and df['neg'] = neg_sent
. I imagine there may be some issues, so let me know if this is in the right ball park.
Upvotes: 1