Reputation: 1045
I want to find the parameters of ParamGridBuilder
that make the best model in CrossValidator in Spark 1.4.x,
In Pipeline Example in Spark documentation, they add different parameters (numFeatures
, regParam
) by using ParamGridBuilder
in the Pipeline. Then by the following line of code they make the best model:
val cvModel = crossval.fit(training.toDF)
Now, I want to know what are the parameters (numFeatures
, regParam
) from ParamGridBuilder
that produces the best model.
I already used the following commands without success:
cvModel.bestModel.extractParamMap().toString()
cvModel.params.toList.mkString("(", ",", ")")
cvModel.estimatorParamMaps.toString()
cvModel.explainParams()
cvModel.getEstimatorParamMaps.mkString("(", ",", ")")
cvModel.toString()
Any help?
Thanks in advance,
Upvotes: 29
Views: 19760
Reputation: 1
For me, the @orangeHIX solution is perfect:
val cvModel = cv.fit(training)
val cvMejorModelo = cvModel.bestModel.asInstanceOf[ALSModel]
cvMejorModelo.parent.extractParamMap()
res86: org.apache.spark.ml.param.ParamMap =
{
als_08eb64db650d-alpha: 0.05,
als_08eb64db650d-checkpointInterval: 10,
als_08eb64db650d-coldStartStrategy: drop,
als_08eb64db650d-finalStorageLevel: MEMORY_AND_DISK,
als_08eb64db650d-implicitPrefs: false,
als_08eb64db650d-intermediateStorageLevel: MEMORY_AND_DISK,
als_08eb64db650d-itemCol: product,
als_08eb64db650d-maxIter: 10,
als_08eb64db650d-nonnegative: false,
als_08eb64db650d-numItemBlocks: 10,
als_08eb64db650d-numUserBlocks: 10,
als_08eb64db650d-predictionCol: prediction,
als_08eb64db650d-rank: 1,
als_08eb64db650d-ratingCol: rating,
als_08eb64db650d-regParam: 0.1,
als_08eb64db650d-seed: 1994790107,
als_08eb64db650d-userCol: user
}
Upvotes: 0
Reputation: 938
This SO thread kinda answers the question.
In a nutshell, you need to cast each object to its supposed-to-be class.
For the case of CrossValidatorModel
, the following is what I did:
import org.apache.spark.ml.tuning.CrossValidatorModel
import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.regression.RandomForestRegressionModel
// Load CV model from S3
val inputModelPath = "s3://path/to/my/random-forest-regression-cv"
val reloadedCvModel = CrossValidatorModel.load(inputModelPath)
// To get the parameters of the best model
(
reloadedCvModel.bestModel
.asInstanceOf[PipelineModel]
.stages(1)
.asInstanceOf[RandomForestRegressionModel]
.extractParamMap()
)
In the example, my pipeline has two stages (a VectorIndexer and a RandomForestRegressor), so the stage index is 1 for my model.
Upvotes: 1
Reputation: 1334
To print everything in paramMap
, you actually don't have to call parent:
cvModel.bestModel.extractParamMap()
To answer OP's question, to get a single best parameter, for example regParam
:
cvModel.bestModel.extractParamMap().apply(cvModel.bestModel.getParam("regParam"))
Upvotes: 4
Reputation: 804
Building in the solution of @macfeliga, a single liner that works for pipelines:
cvModel.bestModel.asInstanceOf[PipelineModel]
.stages.foreach(stage => println(stage.extractParamMap))
Upvotes: 0
Reputation: 1
I am working with Spark Scala 1.6.x and here is a full example of how i can set and fit a CrossValidator
and then return the value of the parameter used to get the best model (assuming that training.toDF
gives a dataframe ready to be used) :
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// Instantiate a LogisticRegression object
val lr = new LogisticRegression()
// Instantiate a ParamGrid with different values for the 'RegParam' parameter of the logistic regression
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.0001, 0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1)).build()
// Setting and fitting the CrossValidator on the training set, using 'MultiClassClassificationEvaluator' as evaluator
val crossVal = new CrossValidator().setEstimator(lr).setEvaluator(new MulticlassClassificationEvaluator).setEstimatorParamMaps(paramGrid)
val cvModel = crossVal.fit(training.toDF)
// Getting the value of the 'RegParam' used to get the best model
val bestModel = cvModel.bestModel // Getting the best model
val paramReference = bestModel.getParam("regParam") // Getting the reference of the parameter you want (only the reference, not the value)
val paramValue = bestModel.get(paramReference) // Getting the value of this parameter
print(paramValue) // In my case : 0.001
You can do the same for any parameter or any other type of model.
Upvotes: 0
Reputation: 6321
This is how you get the chosen parameters
println(cvModel.bestModel.getMaxIter)
println(cvModel.bestModel.getRegParam)
Upvotes: 3
Reputation: 31
this java code should work:
cvModel.bestModel().parent().extractParamMap()
.you can translate it to scala code
parent()
method will return an estimator, you can get the best params then.
Upvotes: 2
Reputation: 41
This is the ParamGridBuilder()
paraGrid = ParamGridBuilder().addGrid(
hashingTF.numFeatures, [10, 100, 1000]
).addGrid(
lr.regParam, [0.1, 0.01, 0.001]
).build()
There are 3 stages in pipeline. It seems we can assess parameters as the following:
for stage in cv_model.bestModel.stages:
print 'stages: {}'.format(stage)
print stage.params
print '\n'
stage: Tokenizer_46ffb9fac5968c6c152b
[Param(parent='Tokenizer_46ffb9fac5968c6c152b', name='inputCol', doc='input column name'), Param(parent='Tokenizer_46ffb9fac5968c6c152b', name='outputCol', doc='output column name')]
stage: HashingTF_40e1af3ba73764848d43
[Param(parent='HashingTF_40e1af3ba73764848d43', name='inputCol', doc='input column name'), Param(parent='HashingTF_40e1af3ba73764848d43', name='numFeatures', doc='number of features'), Param(parent='HashingTF_40e1af3ba73764848d43', name='outputCol', doc='output column name')]
stage: LogisticRegression_451b8c8dbef84ecab7a9
[]
However, there is no parameter in the last stage, logiscRegression.
We can also get weight and intercept parameter from logistregression like the following:
cv_model.bestModel.stages[1].getNumFeatures()
10
cv_model.bestModel.stages[2].intercept
1.5791827733883774
cv_model.bestModel.stages[2].weights
DenseVector([-2.5361, -0.9541, 0.4124, 4.2108, 4.4707, 4.9451, -0.3045, 5.4348, -0.1977, -1.8361])
Full exploration: http://kuanliang.github.io/2016-06-07-SparkML-pipeline/
Upvotes: 1
Reputation: 301
One method to get a proper ParamMap
object is to use CrossValidatorModel.avgMetrics: Array[Double]
to find the argmax ParamMap
:
implicit class BestParamMapCrossValidatorModel(cvModel: CrossValidatorModel) {
def bestEstimatorParamMap: ParamMap = {
cvModel.getEstimatorParamMaps
.zip(cvModel.avgMetrics)
.maxBy(_._2)
._1
}
}
When run on the CrossValidatorModel
trained in the Pipeline Example you cited gives:
scala> println(cvModel.bestEstimatorParamMap)
{
hashingTF_2b0b8ccaeeec-numFeatures: 100,
logreg_950a13184247-regParam: 0.1
}
Upvotes: 20
Reputation: 159
val bestPipelineModel = cvModel.bestModel.asInstanceOf[PipelineModel]
val stages = bestPipelineModel.stages
val hashingStage = stages(1).asInstanceOf[HashingTF]
println("numFeatures = " + hashingStage.getNumFeatures)
val lrStage = stages(2).asInstanceOf[LogisticRegressionModel]
println("regParam = " + lrStage.getRegParam)
Upvotes: 13