Reputation: 3327
So I have a large dataframe with many columns. Let us say the two main columns I am interested are messages and names. Each message will be something like a personal status and will be accompanied by the person's name. Let's say I have a word bank of emotion/feeling words which is fairly huge but a condensed version could look like ['sad', 'happy', 'relieved', 'annoyed', 'angry', 'ecstatic', 'proud', 'disappointed']. For each name (and names could be repeated since they could have multiple messages), I want to keep track of the emotion words they have used and their accompanying counts. For example (I have made this up since I have a really large dataset):
Message Name
I am really happy with my progress. Alice
I was annoyed by his inconsideration. John
I felt proud after seeing her performance. Lisa
I am ecstatic after hearing the good news. Alice
I felt disappointed by her dishonesty Lisa
If I wanted something that would essentially tell me the following (I don't necessarily care about the format as long as it's clean): {Alice: happy:1, ecstatic: 1; John: annoyed:1; Lisa: proud:1, disappointed: 1}.
What is the best/easiest way of doing this? Is a dictionary of dictionaries the best way to do this? Secondly (this is sort of an additional related question), after computing above, what if I wanted to find the most frequently used emotion word for each name. In this example everything is pretty much tied at 1, but imagine the counts were not the same.
Upvotes: 2
Views: 80
Reputation: 2561
If you are looking for a very easy way of accomplishing what you are looking for, I would suggest using the pandas groupby functionality in combination with the collections utility.
df = pd.read_csv('messages.csv')
df['Emotion Words'] = df.Message.apply(extract_emotion_words)
print(df)
Output:
Message Name Emotion Words
0 I am really happy with my progress. Alice [happy]
1 I am really happy with johns progress. Alice [happy]
2 I was annoyed by his inconsideration. John [annoyed]
3 I felt proud after seeing her performance. Lisa [proud]
4 I am ecstatic after hearing the good news. Alice [ecstatic]
5 I felt disappointed by her dishonesty Lisa [disappointed]
from collections import Counter
df.groupby('Name')['Emotion Words'].sum().apply(Counter)
Output:
Name
Alice {'happy': 2, 'ecstatic': 1}
John {'annoyed': 1}
Lisa {'disappointed': 1, 'proud': 1}
dtype: object
Since you also indicated that you want to get the most common emotion word, you can use the Counter.most_common(n)
function where n
is the number of most common words to extract (in your case 1):
df.groupby('Name')['Emotion Words'].sum().apply(
lambda words: Counter(words).most_common(1)
)
Output:
Name
Alice [(happy, 2)]
John [(annoyed, 1)]
Lisa [(disappointed, 1)]
dtype: object
For this little example, I used the following function to extract the emotion words. There sure might be better ways of doing it (perhaps you are already using the nltk, if not I would suggest you look into it).
import re
emotionWords = set(['sad', 'happy', 'relieved', 'annoyed', 'angry', 'ecstatic', 'proud', 'disappointed'])
def extract_emotion_words(message):
tokenized = re.split(r'\W+', message.lower())
return list(set(tokenized) & emotionWords)
Upvotes: 3