Reputation: 11331

How is the TFIDFVectorizer in scikit-learn supposed to work?

I'm trying to get words that are distinctive of certain documents using the TfIDFVectorizer class in scikit-learn. It creates a tfidf matrix with all the words and their scores in all the documents, but then it seems to count common words, as well. This is some of the code I'm running:

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(contents)
feature_names = vectorizer.get_feature_names()
dense = tfidf_matrix.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names, index=characters)
s = pd.Series(df.loc['Adam'])
s[s > 0].sort_values(ascending=False)[:10]

I expected this to return a list of distinctive words for the document 'Adam', but what it does it return a list of common words:

and     0.497077
to      0.387147
the     0.316648
of      0.298724
in      0.186404
with    0.144583
his     0.140998

I might not understand it perfectly, but as I understand it, tf-idf is supposed to find words that are distinctive of one document in a corpus, finding words that appear frequently in one document, but not in other documents. Here, and appears frequently in other documents, so I don't know why it's returning a high value here.

The complete code I'm using to generate this is in this Jupyter notebook.

When I compute tf/idfs semi-manually, using the NLTK and computing scores for each word, I get the appropriate results. For the 'Adam' document:

fresh        0.000813
prime        0.000813
bone         0.000677
relate       0.000677
blame        0.000677
enough       0.000677

That looks about right, since these are words that appear in the 'Adam' document, but not as much in other documents in the corpus. The complete code used to generate this is in this Jupyter notebook.

Am I doing something wrong with the scikit code? Is there another way to initialize this class where it returns the right results? Of course, I can ignore stopwords by passing stop_words = 'english', but that doesn't really solve the problem, since common words of any sort shouldn't have high scores here.

Upvotes: 33

Answers (5)

user2827262

Reputation: 157

The answer to your question may lie in the size of your corpus and source codes for different implementations. I haven't looked into the nltk code in detail, but 3-8 documents (in scikit code) are probably not big enough to construct a corpus. When constructing corpuses; news archives with with hundreds of thousands of articles or thousands of books are used. Maybe frequency of words like 'the' in 8 documents were not large overall to account for commonness of these words among those documents.

If you look at source codes, you might be able to find differences in implementation, whether they follow different normalization steps or frequency distributions (https://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html has common tfidf variants)

Another thing that may help could be looking at the term frequencies (CountVectorizer in scikit) and making sure that words like 'the' are over represented in all documents.

Upvotes: 0

michellew

Reputation: 409

using below code I get much better results

vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words='english')

Output

sustain    0.045090
bone       0.045090
thou       0.044417
thee       0.043673
timely     0.043269
thy        0.042731
prime      0.041628
absence    0.041234
rib        0.041234
feel       0.040259
Name: Adam, dtype: float64

and

thee          0.071188
thy           0.070549
forbids       0.069358
thou          0.068068
early         0.064642
earliest      0.062229
dreamed       0.062229
firmness      0.062229
glistering    0.062229
sweet         0.060770
Name: Eve, dtype: float64

Upvotes: 6

Rabbit

Reputation: 866

I believe your issue lies in using different stopword lists. Scikit-learn and NLTK use different stopword lists by default. For scikit-learn it is usually a good idea to have a custom stop_words list passed to TfidfVectorizer, e.g.:

my_stopword_list = ['and','to','the','of']
my_vectorizer = TfidfVectorizer(stop_words=my_stopword_list)

Doc page for TfidfVectorizer class: [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html][1]

Upvotes: 7

Randy

Reputation: 14857

I'm not sure why it's not the default, but you probably want sublinear_tf=True in the initialization for TfidfVectorizer. I forked your repo and sent you a PR with an example that probably looks more like what you want.

Upvotes: 3

Sagar Waghmode

Reputation: 777

From scikit-learn documentation:

As tf–idf is very often used for text features, there is also another class called TfidfVectorizer that combines all the options of CountVectorizer and TfidfTransformer in a single model.

As you can see, TfidfVectorizer is a CountVectorizer followed by TfidfTransformer.

What you are probably looking for is TfidfTransformer and not TfidfVectorizer

Upvotes: 8

How is the TFIDFVectorizer in scikit-learn supposed to work?

Answers (5)

Related Questions