F Baig
F Baig

Reputation: 357

Java Spark Dataset MapFunction - Task not serializable without any reference to class

I have a following class that reads csv data into Spark's Dataset. Everything works fine if I just simply read and return the data.

However, if I apply a MapFunction to the data before returning from function, I get

Exception in thread "main" org.apache.spark.SparkException: Task not serializable

Caused by: java.io.NotSerializableException: com.Workflow.

I know Spark's working and its need to serialize objects for distributed processing, however, I'm NOT using any reference to Workflow class in my mapping logic. I'm not calling any Workflow class function in my mapping logic. So why is Spark trying to serialize Workflow class? Any help will be appreciated.

public class Workflow {

    private final SparkSession spark;   

    public Dataset<Row> readData(){
        final StructType schema = new StructType()
            .add("text", "string", false)
            .add("category", "string", false);

        Dataset<Row> data = spark.read()
            .schema(schema)
            .csv(dataPath);

        /* 
         * works fine till here if I call
         * return data;
         */

        Dataset<Row> cleanedData = data.map(new MapFunction<Row, Row>() {
            public Row call(Row row){
                /* some mapping logic */
                return row;
            }
        }, RowEncoder.apply(schema));

        cleanedData.printSchema();
        /* .... ERROR .... */
        cleanedData.show();

        return cleanedData;
    }
}

Upvotes: 0

Views: 698

Answers (2)

badger
badger

Reputation: 3256

anonymous inner classes have a hidden/implicit reference to enclosing class. use Lambda expression or go with Roma Anankin's solution

Upvotes: 1

Roma Anankin
Roma Anankin

Reputation: 56

you could make Workflow implement Serializeble and SparkSession as @transient

Upvotes: 0

Related Questions