Reputation: 136
I have a column of a dataframe df['psych_prob'] and I have a dictionary containing different categories and their corresponding list of words. I have to count how many times the words belonging to each category in a column of Pandas dataframe appears.
The code below works for me but my real data set is more than 100k rows and my real categories are more than 40 and words in each category are more than 500. It takes me more than 1 hour to run the code. I am trying to optimize the speed of the following code.
dummy_dict={
'psych_prob':
['he would be happy about it, but i am sad it does not make sense to her',
'how i can be happy i am anxious all the time my fathers death is here']
}
df=pd.DataFrame(dummy_dict)
# dictionary containing categories and their list of words
category_dict={'pronouns':['he', "he'd", "he's", 'her', 'hers'],
'neg_emo':['sad','anxious','angry'],
'pos_emo':['happy','excited']}
for category in category_dict:
#join the list of words by space and pipe
category_joined=' |'.join(e for e in category_dict[category])
#count how many times the list of words belonging to the category appears in the dataframe
category_count=df.psych_prob.str.count(category_joined).sum()
print('Category:',category)
print('Words to search:',category_joined)
print('Total words of this category in dataframe:', category_count)
print('\n')
Upvotes: 0
Views: 74
Reputation: 863146
EDIT: Here is solution with word boundaries and joind values of column to one long sentence:
import re
#join outside loop for improve performance
str_vals = ' '.join(df['psych_prob'])
for category, vals in category_dict.items():
pat = '|'.join(r"\b{}\b".format(x) for x in vals)
category_count = len(re.findall(pat, str_vals))
print('Category:',category)
print('Total words of this category in dataframe:', category_count)
print('\n')
I think you can use Aho-Corasick frequency counter for improve performance with join
by column for one long string:
#https://stackoverflow.com/a/51604049/2901002
def ac_frequency(needles, haystack):
frequencies = [0] * len(needles)
# Make a searcher
searcher = ahocorasick.Automaton()
for i, needle in enumerate(needles):
searcher.add_word(needle, i)
searcher.make_automaton()
# Add up all frequencies
for _, i in searcher.iter(haystack):
frequencies[i] += 1
return frequencies
#join outside loop for improve performance
str_vals = ' '.join(df['psych_prob'])
for category, vals in category_dict.items():
#join the list of words by space and pipe
category_count = ac_frequency(vals, str_vals)
print('Category:',category)
print('Total words of this category in dataframe:', category_count)
print('\n')
Upvotes: 2