user12907213
user12907213

Reputation:

Word frequency with stemming

I would have a question on how to get the sum of words which I consider having similar meaning, so I would like to count as the same word.

For example, I have this dataset:

    Word    Frequency
0   game    52055
1   laura   24953
2   luke    21133
3   story   20739
4   dog     17054
5   like    12792
7   character   8845
9   play    8420
11  characters  8081
12  people  7933
16  good    6496
18  10      6309
19  gameplay6195
22  revenge 5922
25  bad     5331
26  end     5027
27  feel    4833
28  killed  4779
31  kill    4545
33  graphics4372
34  time    4272
35  cat     4244
44  great   3466
45  ending  3379
...
50  love    3059
51  never   2965
52  new     2963
53  killing 2955

This is a dataset with two columns: one with words and another one with their frequency through the document. I would need to consider as same words the following:

I think this should be easily done by using portstemmer. However, I would need also to count their frequency as sum.

So, for example,

28  killed  4779
31  kill    4545
53  killing 2955

should be

31 kill 12279

Unfortunately I could not apply earlier stemming as the dataset I received is as shown above. Could you please give me some advice on how to get this sum?

Upvotes: 0

Views: 598

Answers (1)

Georgina Skibinski
Georgina Skibinski

Reputation: 13397

You can use nltk (df being the input dataframe you've shared):

from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 

ps = PorterStemmer() 
df["Stem"] = df["Word"].apply(ps.stem)
res = df.groupby("Stem")["Frequency"].sum()

Outputs (for the piece you shared):

Stem
10           6309
bad          5331
cat          4244
charact     16926
dog         17054
end          8406
feel         4833
game        52055
gameplay     6195
good         6496
graphic      4372
great        3466
kill        12279
laura       24953
like        12792
love         3059
luke        21133
never        2965
new          2963
peopl        7933
play         8420
reveng       5922
stori       20739
time         4272
Name: Frequency, dtype: int64

Upvotes: 3

Related Questions