Reputation: 17674
I am building a custom estimator for spark. Unfortunately, there seems to be something wrong in how I access Param[T]
default params for the estimator. Here is a minimal example which compares a Transformer
with an Estimator
. The Estimator which has access to the same parameter
trait PreprocessingParam2s extends Params {
final val isInList = new Param[Array[String]](this, "isInList", "list of isInList items")
}
is called like
new ExampleEstimator().setIsInList(Array("def", "ABC")).fit(dates).transform(dates).show
In order to perform
dataset
.withColumn("isInList", when('ISO isin ($(isInList): _*), 1).otherwise(0))
But unlike the Transformer
which works fine, the Estimator
fails with java.util.NoSuchElementException: Failed to find a default value for isInList
https://gist.github.com/geoHeil/218683c6b0f91bc76f71cb652cd746b8 or https://github.com/geoHeil/sparkEstimatorProblem (including build.sbt file to easier reproduce the problem)
What is wrong here?
I am using spark 2.0.2.
As @t-gawęda points out the error can be fixed when setting default parameters. But this should not be necessary as I call new ExampleEstimator().setIsInList(Array("def", "ABC"))
. So why is the parameter not set?
If I set default parameters they are used as a fallback. But this is not the semantics I want to achieve. Instead of the correct output (of the Transformer
)
+----------+---+--------+
| dates|ISO|isInList|
+----------+---+--------+
|2016-01-01|ABC| 1|
|2016-01-02|ABC| 1|
|2016-01-03|POL| 0|
|2016-01-04|ABC| 1|
|2016-01-05|POL| 0|
|2016-01-06|ABC| 1|
|2016-01-07|POL| 0|
|2016-01-08|ABC| 1|
|2016-01-09|def| 1|
|2016-01-10|ABC| 1|
+----------+---+--------+
+--------+
|isInList|
+--------+
| 1|
| 0|
+--------+
I get
+----------+---+--------+
| dates|ISO|isInList|
+----------+---+--------+
|2016-01-01|ABC| 0|
|2016-01-02|ABC| 0|
|2016-01-03|POL| 0|
|2016-01-04|ABC| 0|
|2016-01-05|POL| 0|
|2016-01-06|ABC| 0|
|2016-01-07|POL| 0|
|2016-01-08|ABC| 0|
|2016-01-09|def| 0|
|2016-01-10|ABC| 0|
+----------+---+--------+
+--------+
|isInList|
+--------+
| 0|
+--------+
Where clearly the wrong e.g. only default parameters were used. What is wrong with my approach of storing the parameters? See a working example https://github.com/geoHeil/sparkEstimatorProblem which is setting default parameters.
Upvotes: 1
Views: 1441
Reputation: 16076
Try adding getter and setDefault in your Parameter:
trait PreprocessingParam2s extends Params {
final val isInList = new Param[Array[String]](this, "isInList", "list of isInList items")
setDefault(isInList, /* here put default value */)
/** @group getParam */
final def getIsInList: Array[String] = $(isInList)
}
Why?
Look how params map is created:
lazy val params: Array[Param[_]] = {
val methods = this.getClass.getMethods
methods.filter { m =>
Modifier.isPublic(m.getModifiers) &&
classOf[Param[_]].isAssignableFrom(m.getReturnType) &&
m.getParameterTypes.isEmpty
}.sortBy(_.getName)
.map(m => m.invoke(this).asInstanceOf[Param[_]])
}
Spark scans all methods in class and searches methods with:
Spark launches these methods and put values into parameters map. When you read parameter value, then Spark looks at this map, instead of executing getters
Full code is here Explanation: in model you must have :
def setIsInList(value : Array[String]): this.type = {
set(isInList, value)
this
}
and in fit method of estimator:
override def fit(dataset: Dataset[_]): ExampleTransModel = new ExampleTransModel(uid, 1.0).setIsInList($(this.isInList))
After creating model, you was not copying value of parameter to model object - that's why it was always empty when evaluating model
Upvotes: 4