How to print best model params in pyspark pipeline

Question

This question is similar to this one. I would like to print the best model params after doing a TrainValidationSplit in pyspark. I cannot find the piece of text the other user uses to answer the question because I'm working on jupyter and the log dissapears from the terminal...

Part of the code is:

pca = PCA(inputCol = 'features')
dt = DecisionTreeRegressor(featuresCol=pca.getOutputCol(), 
                           labelCol="energy")
pipe = Pipeline(stages=[pca,dt])

paramgrid = ParamGridBuilder().addGrid(pca.k, range(1,50,2)).addGrid(dt.maxDepth, range(1,10,1)).build()

tvs = TrainValidationSplit(estimator = pipe, evaluator = RegressionEvaluator(
labelCol="energy", predictionCol="prediction", metricName="mae"), estimatorParamMaps = paramgrid, trainRatio = 0.66)

model = tvs.fit(wind_tr_va);

Thanks in advance.

eliasah · Accepted Answer

It follows indeed the same reasoning described in the answer about How to get the maxDepth from a Spark RandomForestRegressionModel given by @user6910411.

You'll need to patch the TrainValidationSplitModel, PCAModel and DecisionTreeRegressionModel as followed :

TrainValidationSplitModel.bestModel = (
    lambda self: self._java_obj.bestModel
)

PCAModel.getK = (
    lambda self: self._java_obj.getK()
)

DecisionTreeRegressionModel.getMaxDepth = (
    lambda self: self._java_obj.getMaxDepth()
)

Now you can use it to get the best model and extract k and maxDepth

bestModel = model.bestModel

bestModelK = bestModel.stages[0].getK()
bestModelMaxDepth = bestModel.stages[1].getMaxDepth()

PS: You can patch models to get specific parameters the same way described above.

How to print best model params in pyspark pipeline

Answers (2)

Related Questions