Reputation: 3644
I want to use the TfidfVectorizer
and associated functions from scikit-learn in order to perform document classification, but I am a little puzzled on its use (and none of the other questions I've searched dealt with proper data formatting).
Currently, my training data is organized in the following way:
After the above is done, a single text looks as follows (values here are random but correspond to the count / number of occurrences of each term ):
text = ({"has": 5.0, "love": 12.0, ...}, "Relationships")
Whereas the full corpus looks something like this at the end:
corpus = [({"has": 5.0, "love": 12.0, ...}, "Relationships"),
({"game": 9, "play": 9.0, ...}, "Games"),
...,
]
How would I feed this data into TfidfVectorizer()
? Do I have to supply just the content (as dictionaries? as lists?) as it is above, or just the content without the counts? When do I supply the labels? I wouldn't mind refactoring my data completely if need be.
The documentation with regard to this specific function is unfortunately a little sparse with examples with regard to formatting.
Upvotes: 2
Views: 2786
Reputation: 40973
This is how you would use TfidfVectorizer
(look here for more details)
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = ['This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?']
>>> vect = TfidfVectorizer()
>>> X = vect.fit_transform(corpus)
>>> X.todense()
matrix([[ 0. , 0.43877674, 0.54197657, 0.43877674, 0. ,
0. , 0.35872874, 0. , 0.43877674],
[ 0. , 0.27230147, 0. , 0.27230147, 0. ,
0.85322574, 0.22262429, 0. , 0.27230147],
[ 0.55280532, 0. , 0. , 0. , 0.55280532,
0. , 0.28847675, 0.55280532, 0. ],
[ 0. , 0.43877674, 0.54197657, 0.43877674, 0. ,
0. , 0.35872874, 0. , 0.43877674]])
This is the numeric representation of your text corpus. Now to fit a model that maps documents to your labels, start by putting them in a target variable, the length of the labels should match the number of documents in the corpus:
>>> y = ['Relationships', 'Games', ...]
Now you can fit any model, for example:
>>> from sklearn.linear_model import SGDClassifier
>>> model = SGDClassifier()
>>> model.fit(X, y)
Now you have a fitted model that you can evaluate on new data. To predict repeat the same process for the new corpus or texts. Note that I am using the same vectorizer vect
as before:
X_pred = vect.transform(['My new document'])
y_pred = model.predict(X_pred)
If you want to use a custom tokenizer use:
vect = TfidfVectorizer(tokenizer=your_custom_tokenizer_function)
Upvotes: 3