Reputation:
I would need to calculate the frequency for every token in the training data, making a list of the tokens which have a frequency at least equal to N. To split my dataset into train and test I did as follows:
X = vectorizer.fit_transform(df['Text'].replace(np.NaN, ""))
y=df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, stratify=y)
If Text
column contains sentences, for example
Text
Show some code
Describe what you've tried
Have a non-programming question?
More helpful links
to extract all tokens I did as follows:
import pandas as pd
from nltk.tokenize import word_tokenize
X_train['tokenized_text'] = X_train.Text.apply(lambda row: word_tokenize(row))
This gives me tokens locally, and not globally. I should have the all list and count through all the rows, in order to make a list of the tokens which have a frequency at least equal to N
.
My difficulties are in counting the frequency of tokens through all the column.
Could you please tell me how to count these tokens?
UPDATE:
The following code works fine:
df.Text.str.split(expand=True).stack().value_counts()
however I don't know how to extract all the words/tokens having count > 15, for example.
Upvotes: 5
Views: 1124
Reputation: 150785
Assuming you say the following works fine
s = df.Text.str.split(expand=True).stack().value_counts()
Then you can do
s[s>=15].index
to get the tokens with at least 15
counts.
However, the first line doesn't give the same tokenization with nltk.word_tokenize
. If you want the latter's output, you can replace the first line with:
s = df.Text.apply(lambda row: word_tokenize(row)).explode().value_counts()
which gives the following from your sample data:
Have 1
you 1
what 1
a 1
Describe 1
've 1
non-programming 1
tried 1
some 1
code 1
? 1
links 1
Show 1
helpful 1
More 1
question 1
Name: Text, dtype: int64
Upvotes: 2
Reputation: 141
You can use the Counter collection to perform what you need and than create a secondary list only with the words that are filtered according to your the limit. Check the code below as an example with limit 2:
from collections import Counter
test_list = ["test", "test", "word", "hello"]
counter = Counter(test_list)
filtered_counter = {k:v for k, v in counter.items() if v >= 2}
Upvotes: 0