Reputation: 1607
Is it possible to execute custom logic while grouping a Spark dataset? Here example of just printing to console, but I would like to e.g. save grouped datasets (after having implemented additional operations) to separate files. In my example, printing "Hey" to console does not work.
// import spark.implicits._
case class Student(name: String, grade: String)
val students = sc.parallelize(Seq(
Student("John", "A"),
Student("John", "B"),
Student("Amy", "C")
)).toDF().as[Student]
def someFunc(key:String, values:Iterator[Student]): TraversableOnce[(String,Student)] = {
println("Hey") // HOW TO GET THIS DONE ?
return values.map(x => (key, x))
}
val groups = students.groupByKey(t => t.name).flatMapGroups(someFunc).show()
Upvotes: 0
Views: 385
Reputation: 46
In my example, printing "Hey" to console does not work.
There is nothing that prevents you from execution arbitrary* code in the closure. However you cannot expect to see stdout output. Remember that this code is executed on remote hosts, not on your local machine.
If you want to collect some output, other than accumulators or task updates, use proper logging and log collector.
* As long it doesn't use distributed data structures and Spark contexts.
Upvotes: 3