Reputation: 43
I've run into something odd in Spark 2.2 and how it deserializes case classes. For these examples, assume this case class:
case class X(a:Int, b:Int) {
println("in the constructor!!!")
}
If I have the following map operation, I see both my constructor and the value of 'a' messages in the executor logs.
ds.map(x => {
val x = X(1, 2)
println(s"a=${x.a})
}
With the following map operation, I do not see my constructor message but I do see the value of 'a' message in the executor logs. The constructor message is in the driver logs.
val x = X(1, 2)
ds.map(x => println(s"a=${x.a}"))
And I get the same behavior if I use a broadcast variable.
val xBcast = sc.broadcast(X(1, 2))
ds.map(x => println(s"a=${xBcast.value.a}"))
Any idea what's going on? Is Spark serializing each field as needed? I would have expected the whole object to be shipped over and deserialized. With that deserialization I'd expect a constructor call.
When I looked at the encoder code for Products it looks like it gets the necessary fields from the constructor. I guess I was assuming it would use those encoders for this kind of stuff.
I even decompiled my case class's class file and the constructor generated seems reasonable.
Upvotes: 4
Views: 2143
Reputation: 20561
Spark is using Java serialization (available because case classes extend Serializable
) by default, which does not require the use of a constructor to deserialize. See this StackOverflow question for details on Java serialization/deserialization.
Note that this reliance on Java serialization can cause issues, as the internal serialization format is not set in stone so JVM version differences can cause deserialization to fail.
Upvotes: 1