FAISAL BARGI
FAISAL BARGI

Reputation: 30

How to get bag of words and term frequency in text format using Sklearn?

I would like to print out the list of words (i.e., bag of words) for each document in a coprus and their respective term frequency (in text format), using Sklearn's CountVectorizer. How could I achieve that?

Here is my code:

from sklearn.feature_extraction.text import CountVectorizer  

#instantiate vectorizer
vectorizer=CountVectorizer()   

#Document creation 
document1='this is a sunny day';document2= 'today is a very very very pleasant day and we have fun fun fun';document3= 'this is an amazin experience'

#list 
list_of_words= [document1,document2,document3]

#bag of words
bag_of_words = vectorizer.fit(list_of_words)

#verify vocabulary of repeated word 
print (vectorizer.vocabulary_.get('very')) 

print (vectorizer.vocabulary_.get('fun'))

#transform
bag_of_words=vectorizer.transform(list_of_words)

print(bag_of_words)>>>>
(0, 3) 1 (0, 7) 1 (0, 9) 1 (0, 10) 1 (1, 2) 1 (1, 3) 1 (1, 5) 3 (1, 6) 1 (1, 7) 1 (1, 8) 1 (1, 11) 1 (1, 12) 3 (1, 13) 1 (2, 0) 1 (2, 1) 1 (2, 4) 1 (2, 7) 1 (2, 10) 1

Upvotes: 1

Views: 926

Answers (1)

Chris
Chris

Reputation: 34560

You could use get_feature_names() and toarray() methods, in order to get to get the list of words and the frequency of each term, respectively. Using Pandas' DataFrame, you could export both lists in a .csv file or your console. The stopwords list provided by nltk could optionally be used to remove any stopwords from the documents (to extend the current list with more stop words, please have a look at this answer).

Example

from sklearn.feature_extraction.text import CountVectorizer  
import pandas as pd
import nltk
from nltk.corpus import stopwords

# You need to run this only once, in order to download the stopwords list
nltk.download('stopwords') 

# Load the stopwords list
stop_words_list = stopwords.words('english')

# The documents 
document1='Hope you have a pleasant day. Have fun.'
document2= 'Today is a very pleasant day and we will have fun fun fun'
document3= 'This event has been amazing. We had a lot of fun the whole day'

# List of documents
list_of_documents= [document1, document2, document3]

# Instantiate CountVectorizer
cv = CountVectorizer(stop_words=stop_words_list)

# Fit and transform
cv_fit = cv.fit_transform(list_of_documents)
word_list = cv.get_feature_names()
count_list = cv_fit.toarray()

# Create a dataframe with words and their respective frequency 
# Each row represents a document starting from document1
df = pd.DataFrame(data=count_list, columns=word_list)

# Print out the df
print(df)

# Optionally, save the df to a csv file
df.to_csv("bag_of_words.csv") 

Output:

Bag of words

To output the term frequency of the entire corpus (i.e., to summarize the results from all documents), you could use the below (in addition to the example above):

import numpy as np

d = dict(zip(word_list, np.asarray(cv_fit.sum(axis=0))[0]))
sorted_d = dict(sorted(d.items(), key=lambda item: item[1], reverse=True))
print(sorted_d)

# Optionally, create a DataFrame
df = pd.DataFrame.from_dict(data=sorted_d, orient='index')
print(df)
df.to_csv("total_freq.csv")

Output:

Total Frequency Results

Upvotes: 0

Related Questions