groupByKey in Spark dataset, execute custom logic along the aggregation

Question

Is it possible to execute custom logic while grouping a Spark dataset? Here example of just printing to console, but I would like to e.g. save grouped datasets (after having implemented additional operations) to separate files. In my example, printing "Hey" to console does not work.

// import spark.implicits._

case class Student(name: String, grade: String)

val students = sc.parallelize(Seq(
  Student("John", "A"),
  Student("John", "B"),
  Student("Amy", "C")
)).toDF().as[Student]

def someFunc(key:String, values:Iterator[Student]): TraversableOnce[(String,Student)] = {
  println("Hey") // HOW TO GET THIS DONE ?
 return values.map(x => (key, x))
}

val groups = students.groupByKey(t => t.name).flatMapGroups(someFunc).show()

user9856051 · Accepted Answer

In my example, printing "Hey" to console does not work.

There is nothing that prevents you from execution arbitrary* code in the closure. However you cannot expect to see stdout output. Remember that this code is executed on remote hosts, not on your local machine.

If you want to collect some output, other than accumulators or task updates, use proper logging and log collector.

* As long it doesn't use distributed data structures and Spark contexts.

groupByKey in Spark dataset, execute custom logic along the aggregation

Answers (1)

Related Questions