Reputation: 13982
Using spark-shell to test a function,
def smallDfToCSV(fname: String, df: org.apache.spark.sql.DataFrame) : Unit = {
import java.io._
val pw = new PrintWriter(new File(fname))
val header = df.head.schema.map(r=>r.name)
pw.write( header.mkString(",")+"\n" ) // fine
df.limit(5000).foreach( r => pw.write(r.mkString(",")+"\n") ) // error!
// org.apache.spark.SparkException: Task not serializable
pw.close
} // \csvWr
val df = spark.sql(query)
smallDfToCSV("./lixo.csv", df)
The error not make sense because it is running:
df.foreach( r => println(r.mkString(",")) )
Upvotes: 0
Views: 131
Reputation: 1380
The task cannot be serialized because PrintWriter
does not implement java.io.Serializable
. Any class that is called on a Spark executor (i.e. inside of a map
, reduce
, foreach
, etc. operation on a dataset or RDD) needs to be serializable so it can be distributed to executors.
I'm curious about the intended goal of your function, as well. Since this function will be executed on your executors, you're going to get partial contents of df
written to lixo.csv
in whatever the current working directory is for your various executors. If you're intending instead to write the entire contents of df
to a file on your local machine, you must first take it out of the RDD via collect
.
Upvotes: 2