Reputation: 81
I'm new to scikit and working with text data in general, and I've been using the sci-kit CountVectorizer as a start to get used to basic features of text data (n-grams) but I want to extend this to analyze for other features.
I would prefer to adapt the countvectorizer rather than make my own, because then I wouldn't have to reimplement sci-kits tf-idf transformer and classifier.
EDIT:
I'm actually still thinking about specific features to be honest, but for my project I wanted to do style classification between documents. I know that for text classification, lemmatizing and stemming is popular for feature extraction, so that may be one. Other features that I am thinking of analyzing include
These are a few ideas I was thinking of, but I'm thinking of more features to test!
Upvotes: 0
Views: 291
Reputation: 28758
Are you asking how to implement the features you listed in terms of a scikit-learn compatible transformer? Then maybe have a look at the developer docs in particular rolling your own estimator.
You can just inherit from BaseEstimator and implement a fit
and a transform
. That is only necessary if you want to use pipelining, though. For using sklearn classifiers and the tfidf-transformer, it is only necessary that your feature extraction creates numpy arrays or scipy sparse matrices.
Upvotes: 1
Reputation: 19862
You can easily extend extend the class (you can see the source of it here) and implement what you need. However, it depends on what you want to do, which is not very clear in your question.
Upvotes: 1