Reputation: 1973
I am trying to understand how scikit-learn pipelines work. I have some dummy data and I am trying to fit a Random Forest model to iris data. Here is some code
from sklearn import datasets
from sklearn import svm
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import sklearn.externals
import joblib
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset
iris = datasets.load_iris()
Divide data into train and test and create a pipeline with 2 steps
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,random_state=0)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
pipeline = Pipeline([('feature_selection', SelectKBest(chi2, k=2)), ('classification', RandomForestClassifier()) ])
print(type(pipeline))
(112, 4) (38, 4) (112,) (38,)
<class 'sklearn.pipeline.Pipeline'>
But when i execute pipeline.fit_transform(X_train, y_train)
, I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'transform'
However, pipeline.fit(X_train, y_train)
works fine.
In a normal case scenario, without any pipeline
code, what i have usually done is taken a ML model and applied fit_transform()
on my training
dataset and transform
on my unseen dataset
for generating predictions.
How do I set something similar using pipelines in sklearn
. I want to SAVE my pipeline and then perform scoring by LOADING it back? Can I do it using pickle
?
Another thing is regarding the RF model itself. I can get model summary using RF model methods but I dont see any methods in my pipeline where I can print model summary using pipeline
.
Upvotes: 4
Views: 2011
Reputation: 12582
But when i execute
pipeline.fit_transform(X_train, y_train)
, I get an error sayingAttributeError: 'RandomForestClassifier' object has no attribute 'transform'
Indeed, RandomForestClassifier
does not transform
data because it is a model, not a transformer. Pipelines implement either transform
or predict
(and its variants) depending on whether the last estimator is a transformer or a model.
So, generally, you'll want to call just pipeline.fit(X_train, y_train)
, then in testing or production you'll call pipeline.predict(X_test, y_test)
(or predict_proba
, or ...), which internally will transform
with the first step(s) and predict
with the last step.
How do I set something similar using pipelines in sklearn. I want to SAVE my pipeline and then perform scoring by LOADING it back? Can I do it using pickle?
Yes; see sklearn Model Persistence for more details and recommendations.
Another thing is regarding the RF model itself. I can get model summary using RF model methods but I dont see any methods in my pipeline where I can print model summary using pipeline.
You can access individual steps of a pipeline in a few ways; see sklearn Pipeline accessing steps
pipeline.named_steps.classification
pipeline['classification']
pipeline[-1]
Upvotes: 2