Reputation: 30
I would like to print out the list of words (i.e., bag of words) for each document in a coprus and their respective term frequency (in text format), using Sklearn's CountVectorizer
. How could I achieve that?
Here is my code:
from sklearn.feature_extraction.text import CountVectorizer
#instantiate vectorizer
vectorizer=CountVectorizer()
#Document creation
document1='this is a sunny day';document2= 'today is a very very very pleasant day and we have fun fun fun';document3= 'this is an amazin experience'
#list
list_of_words= [document1,document2,document3]
#bag of words
bag_of_words = vectorizer.fit(list_of_words)
#verify vocabulary of repeated word
print (vectorizer.vocabulary_.get('very'))
print (vectorizer.vocabulary_.get('fun'))
#transform
bag_of_words=vectorizer.transform(list_of_words)
print(bag_of_words)>>>>
(0, 3) 1 (0, 7) 1 (0, 9) 1 (0, 10) 1 (1, 2) 1 (1, 3) 1 (1, 5) 3 (1, 6) 1 (1, 7) 1 (1, 8) 1 (1, 11) 1 (1, 12) 3 (1, 13) 1 (2, 0) 1 (2, 1) 1 (2, 4) 1 (2, 7) 1 (2, 10) 1
Upvotes: 1
Views: 926
Reputation: 34560
You could use get_feature_names()
and toarray()
methods, in order to get to get the list of words and the frequency of each term, respectively. Using Pandas' DataFrame
, you could export both lists in a .csv
file or your console. The stopwords
list provided by nltk
could optionally be used to remove any stopwords
from the documents (to extend the current list with more stop words, please have a look at this answer).
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import nltk
from nltk.corpus import stopwords
# You need to run this only once, in order to download the stopwords list
nltk.download('stopwords')
# Load the stopwords list
stop_words_list = stopwords.words('english')
# The documents
document1='Hope you have a pleasant day. Have fun.'
document2= 'Today is a very pleasant day and we will have fun fun fun'
document3= 'This event has been amazing. We had a lot of fun the whole day'
# List of documents
list_of_documents= [document1, document2, document3]
# Instantiate CountVectorizer
cv = CountVectorizer(stop_words=stop_words_list)
# Fit and transform
cv_fit = cv.fit_transform(list_of_documents)
word_list = cv.get_feature_names()
count_list = cv_fit.toarray()
# Create a dataframe with words and their respective frequency
# Each row represents a document starting from document1
df = pd.DataFrame(data=count_list, columns=word_list)
# Print out the df
print(df)
# Optionally, save the df to a csv file
df.to_csv("bag_of_words.csv")
To output the term frequency of the entire corpus (i.e., to summarize the results from all documents), you could use the below (in addition to the example above):
import numpy as np
d = dict(zip(word_list, np.asarray(cv_fit.sum(axis=0))[0]))
sorted_d = dict(sorted(d.items(), key=lambda item: item[1], reverse=True))
print(sorted_d)
# Optionally, create a DataFrame
df = pd.DataFrame.from_dict(data=sorted_d, orient='index')
print(df)
df.to_csv("total_freq.csv")
Upvotes: 0