Reputation: 121

How vectorizer fit_transform work in sklearn?

I'm trying to understand the following code

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

When I try to print X to see what will be return, I got this result :

(0, 1)  1

(0, 2)  1

(0, 6)  1

(0, 3)  1

(0, 8)  1

(1, 5)  2

(1, 1)  1

(1, 6)  1

(1, 3)  1

(1, 8)  1

(2, 4)  1

(2, 7)  1

(2, 0)  1

(2, 6)  1

(3, 1)  1

(3, 2)  1

(3, 6)  1

(3, 3)  1

(3, 8)  1

However, I don't understand the meaning of this result ?

Upvotes: 11

Answers (3)

PinkBanter

Reputation: 1976

As @Himanshu writes, this is a "(sentence_index, feature_index) count"

Here, the count part is the "number of times a word appears in a document"

For example,

(0, 1) 1

(0, 2) 1

(0, 6) 1

(0, 3) 1

(0, 8) 1

(1, 5) 2 Only for this example, the count "2" tells that the word "and" appears twice in this document/sentence

(1, 1) 1

(1, 6) 1

(1, 3) 1

(1, 8) 1

(2, 4) 1

(2, 7) 1

(2, 0) 1

(2, 6) 1

(3, 1) 1

(3, 2) 1

(3, 6) 1

(3, 3) 1

(3, 8) 1

Let's change the corpus in your code. Basically, I added the word "second" twice in the second sentence of the corpus list.

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

(0, 1) 1

(0, 2) 1

(0, 6) 1

(0, 3) 1

(0, 8) 1

(1, 5) 4 for the modified corpus, the count "4" tells that the word "second" appears four times in this document/sentence

(1, 1) 1

(1, 6) 1

(1, 3) 1

(1, 8) 1

(2, 4) 1

(2, 7) 1

(2, 0) 1

(2, 6) 1

(3, 1) 1

(3, 2) 1

(3, 6) 1

(3, 3) 1

(3, 8) 1

Upvotes: 4

Himanshu Kriplani

Reputation: 121

You can interpret this as "(sentence_index, feature_index) count"

As there are 3 sentence: it starts from 0 and ends at 2

feature index is word index which u can get from vectorizer.vocabulary_

-> vocabulary_ a dictionary {word:feature_index,...}

so for the example (0, 1) 1

-> 0 : row[the sentence index]

-> 1 : get feature index(i.e. the word) from vectorizer.vocabulary_[1]

-> 1 : count/tfidf (as you have used a count vectorizer, it will give you count)

instead of count vectorizer, if you use tfidf vectorizersee here it will give u tfidf values. I hope I made it clear

Upvotes: 5

Kaan

Reputation: 1

It transforms text to numbers. So with other functions you will be able to count how many times each word existed in the given data set. Im new to programming so maybe there are other fields to use as well.

Upvotes: -2

How vectorizer fit_transform work in sklearn?

Answers (3)

Related Questions