How do I use use the tfidf-calculating functions in scikit-learn?

Question

I want to use the TfidfVectorizer and associated functions from scikit-learn in order to perform document classification, but I am a little puzzled on its use (and none of the other questions I've searched dealt with proper data formatting).

Currently, my training data is organized in the following way:

Get single text from corpus.
Normalize, tokenize (using nltk PunktWordTokenizer), stem (using nltk SnowballStemmer)
Filter out low-length and low-occurrence words that are left
Label the corresponding text

After the above is done, a single text looks as follows (values here are random but correspond to the count / number of occurrences of each term ):

text = ({"has": 5.0, "love": 12.0, ...}, "Relationships")

Whereas the full corpus looks something like this at the end:

corpus = [({"has": 5.0, "love": 12.0, ...}, "Relationships"),
          ({"game": 9, "play": 9.0, ...}, "Games"),
          ...,
         ]

How would I feed this data into TfidfVectorizer()? Do I have to supply just the content (as dictionaries? as lists?) as it is above, or just the content without the counts? When do I supply the labels? I wouldn't mind refactoring my data completely if need be.

The documentation with regard to this specific function is unfortunately a little sparse with examples with regard to formatting.

elyase · Accepted Answer

This is how you would use TfidfVectorizer(look here for more details)

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = ['This is the first document.',
              'This is the second second document.',
              'And the third one.',
              'Is this the first document?']
>>> vect = TfidfVectorizer()
>>> X = vect.fit_transform(corpus)
>>> X.todense()

matrix([[ 0.        ,  0.43877674,  0.54197657,  0.43877674,  0.        ,
          0.        ,  0.35872874,  0.        ,  0.43877674],
        [ 0.        ,  0.27230147,  0.        ,  0.27230147,  0.        ,
          0.85322574,  0.22262429,  0.        ,  0.27230147],
        [ 0.55280532,  0.        ,  0.        ,  0.        ,  0.55280532,
          0.        ,  0.28847675,  0.55280532,  0.        ],
        [ 0.        ,  0.43877674,  0.54197657,  0.43877674,  0.        ,
          0.        ,  0.35872874,  0.        ,  0.43877674]])

This is the numeric representation of your text corpus. Now to fit a model that maps documents to your labels, start by putting them in a target variable, the length of the labels should match the number of documents in the corpus:

>>> y = ['Relationships', 'Games', ...]

Now you can fit any model, for example:

>>> from sklearn.linear_model import SGDClassifier
>>> model = SGDClassifier()
>>> model.fit(X, y)

Now you have a fitted model that you can evaluate on new data. To predict repeat the same process for the new corpus or texts. Note that I am using the same vectorizer vect as before:

X_pred = vect.transform(['My new document'])
y_pred = model.predict(X_pred)

If you want to use a custom tokenizer use:

vect = TfidfVectorizer(tokenizer=your_custom_tokenizer_function)

How do I use use the tfidf-calculating functions in scikit-learn?

Answers (1)

Related Questions