Jonas Palačionis
Jonas Palačionis

Reputation: 4842

Counting occurrences of word in a string in Pandas

I am trying to count how many times a word is present in all of the strings of a Pandas Series I have a dataframe df which follows this logic:

word

hi    
hello
bye
goodbye

And a df_2 which looks like this ( scroll to the right to see another column )

sentence                                                                            metric_x

hello, what a wonderful day                                                         10
I did not said hello today                                                          15
what comes first, hi or hello                                                       25
the most used word is hi                                                            30
hi or hello, which is more formal                                                   50
he said goodbye, even though he never said hi or hello in the first place           5

I am trying to achieve the following in df: counting how many times a word is present and what is the metric_x sum of the values which matches the word.

word        count       metric_x_sum
        
hi          4           110
hello       5           105
bye         0           0
goodbye     1           5

I am using this:

df['count'] = df['word'].apply(lambda x: df_2['sentence'].str.count(x).sum())

The problem lies in the length of the dataframes, I have 70,000 unique words in df and 250,000 unique sentences in df_2, the line above is running for 15 minutes and I have no clue for how long it might run.

After letting it run for 15 minutes I got this error:

error: multiple repeat at position 2

Is there a smarter, faster way of achieving this?

Upvotes: 2

Views: 471

Answers (1)

jezrael
jezrael

Reputation: 862406

First split sentences for words and DataFrame.explode, remove trailing values , by Series.str.strip:

df2 = df_2.assign(word = df_2['sentence'].str.split()).explode('word')
df2['word'] = df2['word'].str.strip(',')
#print (df2)

Then DataFrame.merge with left join and aggregate GroupBy.count for exclude missing values with sum:

df3 = (df.merge(df2, on='word', how='left')
         .groupby('word')
         .agg(count=('metric_x', 'count'), metric_x_sum=('metric_x','sum')))
# print (df3)

Last add to original:

df = df.join(df3, on='word')
df['metric_x_sum'] = df['metric_x_sum'].astype(int)
print (df)
      word  count  metric_x_sum
0       hi      4           110
1    hello      5           105
2      bye      0             0
3  goodbye      1             5

Upvotes: 1

Related Questions