Getting java.io.NotSerializableException while mapping a JavaRDD

Question

Following is the code which is leading to java.io.NotSerializableException when I try to dispatch the job to executors.

    JavaRDD rddToWrite = dataToWrite.toJavaRDD();
    JavaRDD stringRdd = rddToWrite.map(new Function() {

        /**
         * Serial Version Id
         */
        private static final long serialVersionUID = 6766320395808127072L;

        @Override
        public String call(Row row) throws Exception {
            return row.mkString(dataFormat.getDelimiter());
        }
    });

However, when I do the following, the task is serialized successfully :

JavaRDD rddToWrite = dataToWrite.toJavaRDD();
List dataList = rddToWrite.collect().stream().parallel()
                           .map(row -> row.mkString(dataFormat.getDelimiter()))
                           .collect(Collectors.toList());
JavaSparkContext javaSparkContext = new JavaSparkContext(sessionContext.getSparkContext());
JavaRDD stringRDD = javaSparkContext.parallelize(dataList);

Can anyone please help me point out what I'm doing wrong here?

Edit: dataFormat is a private member field in the class where the function containing this code is written. It's an object of a class DataFormat which defines two fields, namely, spark dataformat (e.g. "com.databricks.spark.csv") and the delimiter (e.g. " ").

Alexey Romanov · Accepted Answer

The anonymous class created by new Function ... needs a reference to the enclosing instance, and serializing the function requires serializing the enclosing instance, including dataFormat and all other fields. If that class is not marked as Serializable, or has any non-serializable non-transient fields, it won't work. And even if it does, it silently performs worse than necessary.

Unfortunately, to fully work around this you need to create a named static inner class (or just a separate class), and it can't even be local (because neither anonymous nor local classes in Java can be static):

static class MyFunction extends Function {
    private String delimiter;
    private static final long serialVersionUID = 6766320395808127072L;

    MyFunction(String delimiter) {
        this.delimiter = delimiter;
    }

    @Override
    public String call(Row row) throws Exception {
        return row.mkString(delimiter);
    }
}

And then

JavaRDD stringRdd = rddToWrite.map(new MyFunction(dataFormat.getDelimiter()));

Getting java.io.NotSerializableException while mapping a JavaRDD

Answers (2)

Related Questions