Reputation: 357
I have a following class that reads csv data into Spark's Dataset
. Everything works fine if I just simply read and return the data
.
However, if I apply a MapFunction
to the data
before returning from function, I get
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: com.Workflow
.
I know Spark's working and its need to serialize objects for distributed processing, however, I'm NOT using any reference to Workflow
class in my mapping logic. I'm not calling any Workflow
class function in my mapping logic. So why is Spark trying to serialize Workflow
class? Any help will be appreciated.
public class Workflow {
private final SparkSession spark;
public Dataset<Row> readData(){
final StructType schema = new StructType()
.add("text", "string", false)
.add("category", "string", false);
Dataset<Row> data = spark.read()
.schema(schema)
.csv(dataPath);
/*
* works fine till here if I call
* return data;
*/
Dataset<Row> cleanedData = data.map(new MapFunction<Row, Row>() {
public Row call(Row row){
/* some mapping logic */
return row;
}
}, RowEncoder.apply(schema));
cleanedData.printSchema();
/* .... ERROR .... */
cleanedData.show();
return cleanedData;
}
}
Upvotes: 0
Views: 698
Reputation: 3256
anonymous inner classes have a hidden/implicit reference to enclosing class. use Lambda expression or go with Roma Anankin's solution
Upvotes: 1
Reputation: 56
you could make Workflow implement Serializeble and SparkSession as @transient
Upvotes: 0