Counting tokens in a document

Question

I would need to calculate the frequency for every token in the training data, making a list of the tokens which have a frequency at least equal to N. To split my dataset into train and test I did as follows:

X = vectorizer.fit_transform(df['Text'].replace(np.NaN, ""))

y=df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, stratify=y)

If Text column contains sentences, for example

Text
Show some code
Describe what you've tried
Have a non-programming question?
More helpful links

to extract all tokens I did as follows:

import pandas as pd
from nltk.tokenize import word_tokenize

X_train['tokenized_text'] = X_train.Text.apply(lambda row: word_tokenize(row))

This gives me tokens locally, and not globally. I should have the all list and count through all the rows, in order to make a list of the tokens which have a frequency at least equal to N. My difficulties are in counting the frequency of tokens through all the column.

Could you please tell me how to count these tokens?

UPDATE:

The following code works fine:

df.Text.str.split(expand=True).stack().value_counts()

however I don't know how to extract all the words/tokens having count > 15, for example.

Quang Hoang · Accepted Answer

Assuming you say the following works fine

s = df.Text.str.split(expand=True).stack().value_counts()

Then you can do

s[s>=15].index

to get the tokens with at least 15 counts.

However, the first line doesn't give the same tokenization with nltk.word_tokenize. If you want the latter's output, you can replace the first line with:

s = df.Text.apply(lambda row: word_tokenize(row)).explode().value_counts()

which gives the following from your sample data:

Have               1
you                1
what               1
a                  1
Describe           1
've                1
non-programming    1
tried              1
some               1
code               1
?                  1
links              1
Show               1
helpful            1
More               1
question           1
Name: Text, dtype: int64

Counting tokens in a document

Answers (2)

Related Questions