Most efficient way to apply function to data frame column

Question

I have a very large data frame full of song lyrics. I've tokenized the lyrics column so each row is a list of lyrics, i.e. ["You", "say", "goodbye", "and", "I", "say", "hello"] and so on. I wrote a function to calculate the sentiment score using a list of positive words and negative words. I then need to apply this function to the lyrics column to calculate positive sentiment, negative sentiment, and net sentiment and make them new columns.

I attempted to split my data frame into a list of chunks of 1000 and then loop through to apply, but it is still taking a fairly long time. I'm wondering if there is a more efficient way that I should be doing this, or if this is as good as it gets and I just have to wait it out.

def sentiment_scorer(row):  
    pos=neg=0
    for item in row['lyrics']:
        # count positive words
        if item in positiv:
            pos += 1
        # count negative words
        elif item in negativ:
            neg += 1
        # ignore words that are neither negative nor positive
        else:
            pass
    # set sentiment to 0 if pos is 0
    if pos < 1:
        pos_sent = 0
    else:
        pos_sent = pos / len(row['lyrics'])
    # set sentiment to 0 if neg is 0
    if neg < 1:
        neg_sent = 0
    else:
        neg_sent = neg / len(row['lyrics'])
    # return positive and negative sentiment to make new columns
    return pos_sent, neg_sent

# chunk data frames
n = 1000  
list_df = [lyrics_cleaned_df[i:i+n] for i in range(0,lyrics_cleaned_df.shape[0],n)]

for lr in range(len(list_df)):
    # credit for method: toto_tico on Stack Overflow https://stackoverflow.com/a/46197147
    list_df[lr]['positive_sentiment'], list_df[lr]['negative_sentiment'] = zip(*list_df[lr].apply(sentiment_scorer, axis=1))
    list_df[lr]['net_sentiment'] = list_df[lr]['positive_sentiment'] - list_df[lr]['negative_sentiment']

ETA: sample data frame

data = [['ego-remix', 2009, 'beyonce-knowles', 'Pop', ['oh', 'baby', 'how']], 
        ['then-tell-me', 2009, 'beyonce-knowles', 'Pop', ['playin', 'everything', 'so']], 
        ['honesty', 2009, 'beyonce-knowles', 'Pop', ['if', 'you', 'search']]]
df = pd.DataFrame(data, columns = ['song', 'year', 'artist', 'genre', 'lyrics'])

Josmoor98 · Accepted Answer

If I understand the problem correctly and using your example (I added a few more words to create uneven length lists). You can create a separate dataframe lyrics, converting the words from your lyrics into separate columns.

data = [['ego-remix', 2009, 'beyonce-knowles', 'Pop', ['oh', 'baby', 'how', "d"]], 
        ['then-tell-me', 2009, 'beyonce-knowles', 'Pop', ['playin', 'everything', 'so']], 
        ['honesty', 2009, 'beyonce-knowles', 'Pop', ['if', 'you', 'search']]]

df = pd.DataFrame(data, columns = ['song', 'year', 'artist', 'genre', 'lyrics'])

Then define lyrics.

lyrics = pd.DataFrame(df.lyrics.values.tolist())

#           0            1       2      3
# 0        oh         baby     how      d
# 1    playin   everything      so   None   # Null rows need to be accounted for 
# 2        if          you  search   None   # Null rows need to be accounted for

Then, if you have two lists with your positive and negative sentiment words, like below, you can to calculate the sentiment per row (lyric), using the mean() method.

# positive and negative sentiment words
pos = ['baby', 'you']
neg = ['if', 'so']

# When converting the lyrics list to a new dataframe, it will contain Null values
# when the length of the lists are not the same. Therefore these need to be scaled 
# according to the proportion of null values
null_rows = lyrics.notnull().mean(1)

# Calculate the proportion of positive and negative words, accounting for null values
pos_sent = lyrics.isin(pos).mean(1) / null_rows 
neg_sent = lyrics.isin(neg).mean(1) / null_rows 

# pos_sent
# 0    0.250000
# 1    0.000000
# 2    0.333333

# neg_sent 
# 0    0.000000
# 1    0.333333
# 2    0.333333

If I understand your problem fully, then you should be able to use df['pos'] = pos_sent and df['neg'] = neg_sent. I imagine there may be some issues, so let me know if this is in the right ball park.

Most efficient way to apply function to data frame column

Answers (1)

Related Questions