Georg Heiler
Georg Heiler

Reputation: 17674

Spark custom estimator access to Param[T]

I am building a custom estimator for spark. Unfortunately, there seems to be something wrong in how I access Param[T] default params for the estimator. Here is a minimal example which compares a Transformer with an Estimator. The Estimator which has access to the same parameter

trait PreprocessingParam2s extends Params {
  final val isInList = new Param[Array[String]](this, "isInList", "list of isInList items")
}

is called like

new ExampleEstimator().setIsInList(Array("def", "ABC")).fit(dates).transform(dates).show

In order to perform

dataset
      .withColumn("isInList", when('ISO isin ($(isInList): _*), 1).otherwise(0))

But unlike the Transformer which works fine, the Estimator fails with java.util.NoSuchElementException: Failed to find a default value for isInList

https://gist.github.com/geoHeil/218683c6b0f91bc76f71cb652cd746b8 or https://github.com/geoHeil/sparkEstimatorProblem (including build.sbt file to easier reproduce the problem)

What is wrong here?

edit

I am using spark 2.0.2.

As @t-gawęda points out the error can be fixed when setting default parameters. But this should not be necessary as I call new ExampleEstimator().setIsInList(Array("def", "ABC")). So why is the parameter not set?

If I set default parameters they are used as a fallback. But this is not the semantics I want to achieve. Instead of the correct output (of the Transformer)

+----------+---+--------+
|     dates|ISO|isInList|
+----------+---+--------+
|2016-01-01|ABC|       1|
|2016-01-02|ABC|       1|
|2016-01-03|POL|       0|
|2016-01-04|ABC|       1|
|2016-01-05|POL|       0|
|2016-01-06|ABC|       1|
|2016-01-07|POL|       0|
|2016-01-08|ABC|       1|
|2016-01-09|def|       1|
|2016-01-10|ABC|       1|
+----------+---+--------+

+--------+                                                                      
|isInList|
+--------+
|       1|
|       0|
+--------+

I get

+----------+---+--------+
|     dates|ISO|isInList|
+----------+---+--------+
|2016-01-01|ABC|       0|
|2016-01-02|ABC|       0|
|2016-01-03|POL|       0|
|2016-01-04|ABC|       0|
|2016-01-05|POL|       0|
|2016-01-06|ABC|       0|
|2016-01-07|POL|       0|
|2016-01-08|ABC|       0|
|2016-01-09|def|       0|
|2016-01-10|ABC|       0|
+----------+---+--------+

+--------+
|isInList|
+--------+
|       0|
+--------+

Where clearly the wrong e.g. only default parameters were used. What is wrong with my approach of storing the parameters? See a working example https://github.com/geoHeil/sparkEstimatorProblem which is setting default parameters.

Upvotes: 1

Views: 1441

Answers (1)

T. Gawęda
T. Gawęda

Reputation: 16076

Try adding getter and setDefault in your Parameter:

trait PreprocessingParam2s extends Params {
  final val isInList = new Param[Array[String]](this, "isInList", "list of isInList items")

 setDefault(isInList, /* here put default value */)

  /** @group getParam */
  final def getIsInList: Array[String] = $(isInList)
}

Why?

Look how params map is created:

lazy val params: Array[Param[_]] = {
    val methods = this.getClass.getMethods
    methods.filter { m =>
        Modifier.isPublic(m.getModifiers) &&
          classOf[Param[_]].isAssignableFrom(m.getReturnType) &&
          m.getParameterTypes.isEmpty
      }.sortBy(_.getName)
      .map(m => m.invoke(this).asInstanceOf[Param[_]])
}

Spark scans all methods in class and searches methods with:

  • public modifier
  • no parameters
  • there return type is subtype of Param

Spark launches these methods and put values into parameters map. When you read parameter value, then Spark looks at this map, instead of executing getters

Full code is here Explanation: in model you must have :

def setIsInList(value : Array[String]): this.type = {
    set(isInList, value)
    this
  }

and in fit method of estimator:

override def fit(dataset: Dataset[_]): ExampleTransModel = new ExampleTransModel(uid, 1.0).setIsInList($(this.isInList))

After creating model, you was not copying value of parameter to model object - that's why it was always empty when evaluating model

Upvotes: 4

Related Questions