Reputation: 4842
I am trying to count how many times a word is present in all of the strings of a Pandas Series
I have a dataframe df
which follows this logic:
word
hi
hello
bye
goodbye
And a df_2
which looks like this ( scroll to the right to see another column )
sentence metric_x
hello, what a wonderful day 10
I did not said hello today 15
what comes first, hi or hello 25
the most used word is hi 30
hi or hello, which is more formal 50
he said goodbye, even though he never said hi or hello in the first place 5
I am trying to achieve the following in df
: counting how many times a word
is present and what is the metric_x
sum of the values which matches the word
.
word count metric_x_sum
hi 4 110
hello 5 105
bye 0 0
goodbye 1 5
I am using this:
df['count'] = df['word'].apply(lambda x: df_2['sentence'].str.count(x).sum())
The problem lies in the length of the dataframes, I have 70,000
unique words in df
and 250,000
unique sentences in df_2
, the line above is running for 15 minutes and I have no clue for how long it might run.
After letting it run for 15 minutes I got this error:
error: multiple repeat at position 2
Is there a smarter, faster way of achieving this?
Upvotes: 2
Views: 471
Reputation: 862406
First split sentences for words and DataFrame.explode
, remove trailing values ,
by Series.str.strip
:
df2 = df_2.assign(word = df_2['sentence'].str.split()).explode('word')
df2['word'] = df2['word'].str.strip(',')
#print (df2)
Then DataFrame.merge
with left join and aggregate GroupBy.count
for exclude missing values with sum
:
df3 = (df.merge(df2, on='word', how='left')
.groupby('word')
.agg(count=('metric_x', 'count'), metric_x_sum=('metric_x','sum')))
# print (df3)
Last add to original:
df = df.join(df3, on='word')
df['metric_x_sum'] = df['metric_x_sum'].astype(int)
print (df)
word count metric_x_sum
0 hi 4 110
1 hello 5 105
2 bye 0 0
3 goodbye 1 5
Upvotes: 1