user12907213
user12907213

Reputation:

Counting tokens in a document

I would need to calculate the frequency for every token in the training data, making a list of the tokens which have a frequency at least equal to N. To split my dataset into train and test I did as follows:

X = vectorizer.fit_transform(df['Text'].replace(np.NaN, ""))

y=df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, stratify=y)

If Text column contains sentences, for example

Text
Show some code
Describe what you've tried
Have a non-programming question?
More helpful links 

to extract all tokens I did as follows:

import pandas as pd
from nltk.tokenize import word_tokenize

X_train['tokenized_text'] = X_train.Text.apply(lambda row: word_tokenize(row))

This gives me tokens locally, and not globally. I should have the all list and count through all the rows, in order to make a list of the tokens which have a frequency at least equal to N. My difficulties are in counting the frequency of tokens through all the column.

Could you please tell me how to count these tokens?

UPDATE:

The following code works fine:

df.Text.str.split(expand=True).stack().value_counts()

however I don't know how to extract all the words/tokens having count > 15, for example.

Upvotes: 5

Views: 1124

Answers (2)

Quang Hoang
Quang Hoang

Reputation: 150785

Assuming you say the following works fine

s = df.Text.str.split(expand=True).stack().value_counts()

Then you can do

s[s>=15].index

to get the tokens with at least 15 counts.

However, the first line doesn't give the same tokenization with nltk.word_tokenize. If you want the latter's output, you can replace the first line with:

s = df.Text.apply(lambda row: word_tokenize(row)).explode().value_counts()

which gives the following from your sample data:

Have               1
you                1
what               1
a                  1
Describe           1
've                1
non-programming    1
tried              1
some               1
code               1
?                  1
links              1
Show               1
helpful            1
More               1
question           1
Name: Text, dtype: int64

Upvotes: 2

You can use the Counter collection to perform what you need and than create a secondary list only with the words that are filtered according to your the limit. Check the code below as an example with limit 2:

from collections import Counter
test_list = ["test", "test", "word", "hello"]

counter = Counter(test_list)
filtered_counter = {k:v for k, v in counter.items() if v >= 2}

Upvotes: 0

Related Questions