Reputation: 61
I have done the following pipeline:
max_features=None, min_df=2,ngram_range=(1, 3)
1- How can I print the output of this pipeline? I mean (the 1-3 grams) and if I want to generate my bigram by my self what is the best solution ?
2-If I want to add a constraint like min-TF >1 ?
here is my configuration:
from sklearn.naive_bayes import MultinomialNB,BernoulliNB,GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('count_vectorizer', TfidfVectorizer(max_features=None, min_df=2,ngram_range=(1, 3),token_pattern=r'\s',analyzer = 'word' ,lowercase=False, stop_words=StopWordsList)),
('tfidf_transformer', TfidfTransformer(norm='l2', smooth_idf=False, sublinear_tf=False, use_idf=True)), ('classifier', MultinomialNB() )# SVC(kernel='rbf', probability=True) )
])
Upvotes: 2
Views: 1005
Reputation: 2758
You can get specific elements from your pipeline by named_steps
.
1. You can access to your 'count_vectorizer' and print the idf_
attribute, which represents "The learned idf vector (global term weights)"
pipeline.named_steps['count_vectorizer'].idf_
1.1 And of course you can print the vocabulary, which gives you a dictionary with the ngrams and their column correspondance with the learned idf vector
pipeline.named_steps['count_vectorizer'].vocabulary_
1.2 I wouldn't generate a bigram by myself. You can change your pipeline params whenever you want by using the pipeline set_params
function. In your case:
pipeline.set_params(count_vectorizer__ngram_range=(1,2))
Note here how the params are constructed. So your count_vectorizer__ngram_range
has a prefix count_vectorizer
which is the name you used in your pipeline for the exact element. It is followed by the __
marker, meaning that what's next is the param name for that element, in this case you are choosing the ngram_range
.
But if you meant to choose explicitly which words you want to count it can be done through the vocabulary
parameter. From the documentation "vocabulary : Mapping or iterable, optional
Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.". So you could create something like {'awesome unicorns':0, 'batman forever':1}
and it would only perform the tf-idf on your bigrams 'awesome unicorns' and 'batman forever' ;)
2. You could add the constraint also 'on the fly' same way we did in 1.2
pipeline.set_params(count_vectorizer__min_df=2)
. Although I see you already set this variable in your TfidfVectorizer
initial params. So I'm not sure I understood your question here.
Do not forget to run your pipeline with some data, otherwise there won't be any vocabulary to print. For example, I loaded some 20newsgroups data to perform my tests and then fitted your pipeline. Just in case it of some use for you:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='train', categories=['alt.atheism'])
pipeline.fit(data.data,data.target)
pipeline.named_steps['count_vectorizer'].idf_
pipeline.named_steps['count_vectorizer'].vocabulary_
pipeline.set_params(count_vectorizer__ngram_range=(1, 2)).fit(data.data,data.target)
Recommendation: if you'd like to try with several possible configurations in your pipelines you can always set a range of parameter values and get the best scores by a GridSearch, here is a nice example http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html
Upvotes: 2