Reputation: 391
I have a problem with Java 8 and Spark 2.1.1
I have a (valid) regular expression saved in a variable called "pattern". When I try to use this variable to filter the content loaded from a text file, a SparkException is thrown: Task not serializable. Can anyone help me? Here is the code:
JavaRDD<String> lines = sc.textFile(path);
JavaRDD<String> filtered = lines.filter(new Function<String, Boolean>() {
@Override
public Boolean call(String v1) throws Exception {
return v1.contains(pattern);
}
});
And here is the error stack
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:387)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:386)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.filter(RDD.scala:386)
at org.apache.spark.api.java.JavaRDD.filter(JavaRDD.scala:78)
at FileReader.filteredRDD(FileReader.java:47)
at FileReader.main(FileReader.java:68)
Caused by: java.io.NotSerializableException: FileReader
Serialization stack:
- object not serializable (class: FileReader, value: FileReader@6107165)
- field (class: FileReader$1, name: this$0, type: class FileReader)
- object (class FileReader$1, FileReader$1@7c447c76)
- field (class: org.apache.spark.api.java.JavaRDD$$anonfun$filter$1, name: f$1, type: interface org.apache.spark.api.java.function.Function)
- object (class org.apache.spark.api.java.JavaRDD$$anonfun$filter$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
Upvotes: 2
Views: 1214
Reputation: 6994
According to the report generated by spark for non-serializability
- object not serializable (class: FileReader,
value: FileReader@6107165)
- field (class: FileReader$1, name: this$0, type:
class FileReader)
- object (class FileReader$1,
FileReader$1@7c447c76)
suggests the FileReader
in the class where the closure is is non serializable. It happens when spark is not able to serialize only the method. Spark sees that and since methods cannot be serialized on their own, Spark tries to serialize the whole class.
In your code the variable pattern
I presume is a class variable. This is causing the problem. Spark is unsure how to serialize the pattern
without serializing the whole class.
Try passing the pattern as a local variable to the closure and that will work.
Upvotes: 2