Turo
Turo

Reputation: 1607

groupByKey in Spark dataset, execute custom logic along the aggregation

Is it possible to execute custom logic while grouping a Spark dataset? Here example of just printing to console, but I would like to e.g. save grouped datasets (after having implemented additional operations) to separate files. In my example, printing "Hey" to console does not work.

// import spark.implicits._

case class Student(name: String, grade: String)

val students = sc.parallelize(Seq(
  Student("John", "A"),
  Student("John", "B"),
  Student("Amy", "C")
)).toDF().as[Student]

def someFunc(key:String, values:Iterator[Student]): TraversableOnce[(String,Student)] = {
  println("Hey") // HOW TO GET THIS DONE ?
 return values.map(x => (key, x))
}

val groups = students.groupByKey(t => t.name).flatMapGroups(someFunc).show()

Upvotes: 0

Views: 385

Answers (1)

user9856051
user9856051

Reputation: 46

In my example, printing "Hey" to console does not work.

There is nothing that prevents you from execution arbitrary* code in the closure. However you cannot expect to see stdout output. Remember that this code is executed on remote hosts, not on your local machine.

If you want to collect some output, other than accumulators or task updates, use proper logging and log collector.

* As long it doesn't use distributed data structures and Spark contexts.

Upvotes: 3

Related Questions