Peter Krauss
Peter Krauss

Reputation: 13982

DF passed as parameter is not working, task not serializable

Using spark-shell to test a function,

def smallDfToCSV(fname: String, df: org.apache.spark.sql.DataFrame) : Unit = {
  import java.io._
  val pw = new PrintWriter(new File(fname))
  val header = df.head.schema.map(r=>r.name)
  pw.write( header.mkString(",")+"\n" )  // fine 
  df.limit(5000).foreach(    r => pw.write(r.mkString(",")+"\n")  ) // error!
   // org.apache.spark.SparkException: Task not serializable
  pw.close
} // \csvWr

val df = spark.sql(query)
smallDfToCSV("./lixo.csv", df)

The error not make sense because it is running:

df.foreach(  r => println(r.mkString(","))  ) 

Upvotes: 0

Views: 131

Answers (1)

Charlie Flowers
Charlie Flowers

Reputation: 1380

The task cannot be serialized because PrintWriter does not implement java.io.Serializable. Any class that is called on a Spark executor (i.e. inside of a map, reduce, foreach, etc. operation on a dataset or RDD) needs to be serializable so it can be distributed to executors.

I'm curious about the intended goal of your function, as well. Since this function will be executed on your executors, you're going to get partial contents of df written to lixo.csv in whatever the current working directory is for your various executors. If you're intending instead to write the entire contents of df to a file on your local machine, you must first take it out of the RDD via collect.

Upvotes: 2

Related Questions