Lylia John
Lylia John

Reputation: 61

My PipeLine Configuration for Text Classification using sklearn in python

I have done the following pipeline:

max_features=None, min_df=2,ngram_range=(1, 3)

1- How can I print the output of this pipeline? I mean (the 1-3 grams) and if I want to generate my bigram by my self what is the best solution ?

2-If I want to add a constraint like min-TF >1 ?

here is my configuration:

from sklearn.naive_bayes import MultinomialNB,BernoulliNB,GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

pipeline = Pipeline([   
    ('count_vectorizer',   TfidfVectorizer(max_features=None, min_df=2,ngram_range=(1, 3),token_pattern=r'\s',analyzer = 'word' ,lowercase=False, stop_words=StopWordsList)),
    ('tfidf_transformer',  TfidfTransformer(norm='l2', smooth_idf=False, sublinear_tf=False, use_idf=True)),    ('classifier', MultinomialNB()  )# SVC(kernel='rbf', probability=True) )
])

Upvotes: 2

Views: 1005

Answers (1)

Guiem Bosch
Guiem Bosch

Reputation: 2758

You can get specific elements from your pipeline by named_steps.

1. You can access to your 'count_vectorizer' and print the idf_ attribute, which represents "The learned idf vector (global term weights)"

pipeline.named_steps['count_vectorizer'].idf_

1.1 And of course you can print the vocabulary, which gives you a dictionary with the ngrams and their column correspondance with the learned idf vector

pipeline.named_steps['count_vectorizer'].vocabulary_

1.2 I wouldn't generate a bigram by myself. You can change your pipeline params whenever you want by using the pipeline set_params function. In your case:

pipeline.set_params(count_vectorizer__ngram_range=(1,2))

Note here how the params are constructed. So your count_vectorizer__ngram_range has a prefix count_vectorizer which is the name you used in your pipeline for the exact element. It is followed by the __ marker, meaning that what's next is the param name for that element, in this case you are choosing the ngram_range .

But if you meant to choose explicitly which words you want to count it can be done through the vocabulary parameter. From the documentation "vocabulary : Mapping or iterable, optional Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.". So you could create something like {'awesome unicorns':0, 'batman forever':1} and it would only perform the tf-idf on your bigrams 'awesome unicorns' and 'batman forever' ;)

2. You could add the constraint also 'on the fly' same way we did in 1.2 pipeline.set_params(count_vectorizer__min_df=2). Although I see you already set this variable in your TfidfVectorizer initial params. So I'm not sure I understood your question here.

Do not forget to run your pipeline with some data, otherwise there won't be any vocabulary to print. For example, I loaded some 20newsgroups data to perform my tests and then fitted your pipeline. Just in case it of some use for you:

from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='train', categories=['alt.atheism'])
pipeline.fit(data.data,data.target)
pipeline.named_steps['count_vectorizer'].idf_
pipeline.named_steps['count_vectorizer'].vocabulary_
pipeline.set_params(count_vectorizer__ngram_range=(1, 2)).fit(data.data,data.target)

Recommendation: if you'd like to try with several possible configurations in your pipelines you can always set a range of parameter values and get the best scores by a GridSearch, here is a nice example http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html

Upvotes: 2

Related Questions