Spark ML insert/fit custom OneHotEncoder into a Pipeline

Question

Say I have a few features/columns in a dataframe on which I apply the regular OneHotEncoder, and one (let, n-th) column on which I need to apply my custom OneHotEncoder. Then I need to use VectorAssembler to assemble those features, and put into a Pipeline, finally fitting my trainData and getting predictions from my testData, such as:

val sIndexer1 = new StringIndexer().setInputCol("my_feature1").setOutputCol("indexed_feature1")
// ... let, n-1 such sIndexers for n-1 features
val featureEncoder = new OneHotEncoderEstimator().setInputCols(Array(sIndexer1.getOutputCol), ...).
      setOutputCols(Array("encoded_feature1", ... ))

// **need to insert output from my custom OneHotEncoder function (please see below)**
// (which takes the n-th feature as input) in a way that matches the VectorAssembler below

val vectorAssembler = new VectorAssembler().setInputCols(featureEncoder.getOutputCols + ???).
      setOutputCol("assembled_features")

...

val pipeline = new Pipeline().setStages(Array(sIndexer1, ...,featureEncoder, vectorAssembler, myClassifier))
val model = pipeline.fit(trainData)
val predictions = model.transform(testData)

How can I modify the building of the vectorAssembler so that it can ingest the output from the custom OneHotEncoder? The problem is my desired oheEncodingTopN() cannot/should not refer to the "actual" dataframe, since it would be a part of the pipeline (to apply on trainData/testData).

Note:

I tested that the custom OneHotEncoder (see link) works just as expected separately on e.g. trainData. Basically, oheEncodingTopN applies OneHotEncoding on the input column, but for the top N frequent values only (e.g. N = 50), and put all the rest infrequent values in a dummy column (say, "default"), e.g.:

val oheEncoded = oheEncodingTopN(df, "my_featureN", 50)

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, lit, when}
import org.apache.spark.sql.Column


def flip(col: Column): Column = when(col === 1, lit(0)).otherwise(lit(1))

def oheEncodingTopN(df: DataFrame, colName: String, n: Int): DataFrame = {
  df.createOrReplaceTempView("data")
  val topNDF = spark.sql(s"select $colName, count(*) as count from data group by $colName order by count desc limit $n")

  val pivotTopNDF = topNDF.
    groupBy(colName).
    pivot(colName).
    count().
    withColumn("default", lit(1))

  val joinedTopNDF = df.join(pivotTopNDF, Seq(colName), "left").drop(colName)

  val oheEncodedDF = joinedTopNDF.
    na.fill(0, joinedTopNDF.columns).
    withColumn("default", flip(col("default")))

   oheEncodedDF

}

Spark ML insert/fit custom OneHotEncoder into a Pipeline

Answers (1)

Related Questions