Training ml models on spark per partitions. Such that there will be a trained model per partition of dataframe

Question

How to do parallel model training per partition in spark using scala? The solution given here is in Pyspark. I'm looking for solution in scala. How can you efficiently build one ML model per partition in Spark with foreachPartition?

Som · Accepted Answer

Get the distinct partitions using partition col
Create a threadpool of say 100 threads
create future object for each threads and run

sample code may be as follows-

   // Get an ExecutorService 
    val threadPoolExecutorService = getExecutionContext("name", 100)
// check https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/HasParallelism.scala#L50

   val uniquePartitionValues: List[String] = ...//getDistingPartitionsUsingPartitionCol
    // Asynchronous invocation to training. The result will be collected from the futures.
    val uniquePartitionValuesFutures = uniquePartitionValues.map(partitionValue => {
      Future[Double] {
        try {
            // get dataframe where partitionCol=partitionValue
            val partitionDF = mainDF.where(s"partitionCol=$partitionValue")
          // do preprocessing and training using any algo with an input partitionDF and return accuracy
        } catch {
          ....
      }(threadPoolExecutorService)
    })

    // Wait for metrics to be calculated
    val foldMetrics = uniquePartitionValuesFutures.map(Await.result(_, Duration.Inf))
    println(s"output::${foldMetrics.mkString("  ###  ")}")

Training ml models on spark per partitions. Such that there will be a trained model per partition of dataframe

Answers (1)

Related Questions