Nice-kun
Nice-kun

Reputation: 81

Is it possible to adapt the sci-kit CountVectorizer for other features (not just n-grams)?

I'm new to scikit and working with text data in general, and I've been using the sci-kit CountVectorizer as a start to get used to basic features of text data (n-grams) but I want to extend this to analyze for other features.

I would prefer to adapt the countvectorizer rather than make my own, because then I wouldn't have to reimplement sci-kits tf-idf transformer and classifier.

EDIT:

I'm actually still thinking about specific features to be honest, but for my project I wanted to do style classification between documents. I know that for text classification, lemmatizing and stemming is popular for feature extraction, so that may be one. Other features that I am thinking of analyzing include

These are a few ideas I was thinking of, but I'm thinking of more features to test!

Upvotes: 0

Views: 291

Answers (2)

Andreas Mueller
Andreas Mueller

Reputation: 28758

Are you asking how to implement the features you listed in terms of a scikit-learn compatible transformer? Then maybe have a look at the developer docs in particular rolling your own estimator.

You can just inherit from BaseEstimator and implement a fit and a transform. That is only necessary if you want to use pipelining, though. For using sklearn classifiers and the tfidf-transformer, it is only necessary that your feature extraction creates numpy arrays or scipy sparse matrices.

Upvotes: 1

Tarantula
Tarantula

Reputation: 19862

You can easily extend extend the class (you can see the source of it here) and implement what you need. However, it depends on what you want to do, which is not very clear in your question.

Upvotes: 1

Related Questions