dsmlaws
dsmlaws

Reputation: 67

pyspark Model interpretation from pipeline model

I am implementing DecisionTreeClassifier in pyspark using the Pipeline module as I have several feature engineering steps to perform on my dataset. The code is similar to the example from Spark documentation:

from pyspark import SparkContext, SQLContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load the data stored in LIBSVM format as a DataFrame.
data = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# We specify maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")

# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="precision")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 - accuracy))

treeModel = model.stages[2]
# summary only
print(treeModel)

The question is how do I perform the model interpretation on this? The pipeline model object does not have the method toDebugString() similar to the method in the DecisionTree.trainClassifier class And I cannot use the DecisionTree.trainClassifier in my pipeline because the trainclassifier() takes the training data as a parameter.

Whereas the pipeline accepts the training data as an argument in the fit() method and transform() on the test data

Is there a way to use the pipeline and still perform the model interpretation & find attribute importance?

Upvotes: 3

Views: 3749

Answers (1)

ffmmmm
ffmmmm

Reputation: 41

Yes, I have used the method below in almost all my model interpretations in pyspark. The line below uses the naming conventions from your code excerpt.

dtm = model.stages[-1] # you estimator is the last stage in the pipeline
# hence the DecisionTreeClassifierModel will be the last transformer in the PipelineModel object 
dtm.explainParams()

Now you have access to all the methods of the DecisionTreeClassifierModel. All the available methods and attributes can be found here. Code was not tested on your example.

Upvotes: 4

Related Questions