Java Spark Dataset MapFunction - Task not serializable without any reference to class

Question

I have a following class that reads csv data into Spark's Dataset. Everything works fine if I just simply read and return the data.

However, if I apply a MapFunction to the data before returning from function, I get

Exception in thread "main" org.apache.spark.SparkException: Task not serializable

Caused by: java.io.NotSerializableException: com.Workflow.

I know Spark's working and its need to serialize objects for distributed processing, however, I'm NOT using any reference to Workflow class in my mapping logic. I'm not calling any Workflow class function in my mapping logic. So why is Spark trying to serialize Workflow class? Any help will be appreciated.

public class Workflow {

    private final SparkSession spark;   

    public Dataset readData(){
        final StructType schema = new StructType()
            .add("text", "string", false)
            .add("category", "string", false);

        Dataset data = spark.read()
            .schema(schema)
            .csv(dataPath);

        /* 
         * works fine till here if I call
         * return data;
         */

        Dataset cleanedData = data.map(new MapFunction() {
            public Row call(Row row){
                /* some mapping logic */
                return row;
            }
        }, RowEncoder.apply(schema));

        cleanedData.printSchema();
        /* .... ERROR .... */
        cleanedData.show();

        return cleanedData;
    }
}

badger · Accepted Answer

anonymous inner classes have a hidden/implicit reference to enclosing class. use Lambda expression or go with Roma Anankin's solution

Java Spark Dataset MapFunction - Task not serializable without any reference to class

Answers (2)

Related Questions