Reputation: 231
I have a set of words for which I have to check whether they are present in the documents.
WordList = [w1, w2, ..., wn]
Another set have list of documents on which I have to check whether these words are present or not.
How to use scikit-learn CountVectorizer
so that the features of term-document matrix are only words from WordList
and each row represents each particular document with no of times the word from the given list appears in their respective column?
Upvotes: 2
Views: 10289
Reputation: 61
For custom documents, you can use Count Vectorizer approach
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer() #make object of Count Vectorizer
corpus = [
'This is a cat.',
'It likes to roam in the garden',
'It is black in color',
'The cat does not like the dog.',
]
X = vectorizer.fit_transform(corpus)
#print(X) to see count given to words
vectorizer.get_feature_names() == (
['cat', 'color', 'roam', 'The', 'garden',
'dog', 'black', 'like', 'does', 'not',
'the', 'in', 'likes'])
X.toarray()
#used to convert X into numpy array
vectorizer.transform(['A new cat.']).toarray()
# Checking it for a new document
Other Vectorizers can also be used like Tfidf Vectorizer. Tfidf vectorizer is a better approach as it not only provides with the number of occurences of words in a particular document but also tells about the importance of the word.
It is calculated by finding TF- term frequency and IDF- Inverse Document Frequency.
Term Freq is the number of times a word appeared in a particular document and IDF is calculated based on the context of the document. For eg., if the documents are related to football, then the word "the" would not give any insight but the word "messi" would tell about the context of the document. It is calculated by taking log of the number of occurences. Eg. tf("the") = 10 tf("messi") = 5
idf("the") = log(10) = 0
idf("messi") = log(5) = 0.52
tfidf("the") = tf("the") * idf("the") = 10 * 0 = 0
tfidf("messi") = 5 * 0.52 = 2.6
These weights help the algorithm to identify the important words out of the documents that later helps to derive semantics out of the doc.
Upvotes: 3
Reputation: 231
Ok. I get it. The code is given below:
from sklearn.feature_extraction.text import CountVectorizer
# Counting the no of times each word(Unigram) appear in document.
vectorizer = CountVectorizer(input='content',binary=False,ngram_range=(1,1))
# First set the vocab
vectorizer = vectorizer.fit(WordList)
# Now transform the text contained in each document i.e list of text
Document:list
tfMatrix = vectorizer.transform(Document_List).toarray()
This will output only the term-document matrix with features from wordList only.
Upvotes: 2