How does Spark serialization work for case classes?

Question

I've run into something odd in Spark 2.2 and how it deserializes case classes. For these examples, assume this case class:

case class X(a:Int, b:Int) {
  println("in the constructor!!!")
}

If I have the following map operation, I see both my constructor and the value of 'a' messages in the executor logs.

ds.map(x => {
  val x = X(1, 2)
  println(s"a=${x.a})
}

With the following map operation, I do not see my constructor message but I do see the value of 'a' message in the executor logs. The constructor message is in the driver logs.

val x = X(1, 2)
ds.map(x => println(s"a=${x.a}"))

And I get the same behavior if I use a broadcast variable.

val xBcast = sc.broadcast(X(1, 2))
ds.map(x => println(s"a=${xBcast.value.a}"))

Any idea what's going on? Is Spark serializing each field as needed? I would have expected the whole object to be shipped over and deserialized. With that deserialization I'd expect a constructor call.

When I looked at the encoder code for Products it looks like it gets the necessary fields from the constructor. I guess I was assuming it would use those encoders for this kind of stuff.

I even decompiled my case class's class file and the constructor generated seems reasonable.

Levi Ramsey · Accepted Answer

Spark is using Java serialization (available because case classes extend Serializable) by default, which does not require the use of a constructor to deserialize. See this StackOverflow question for details on Java serialization/deserialization.

Note that this reliance on Java serialization can cause issues, as the internal serialization format is not set in stone so JVM version differences can cause deserialization to fail.

How does Spark serialization work for case classes?

Answers (1)

Related Questions