Spark, when we need to enable KryoSerializer?

Question

I have a spark (version 2.4.7) job,

JavaRDD rows = javaSparkContext.newAPIHadoopFile(...)
                    .map(d -> {
                      Foo foo = Foo.parseFrom(d._2.copyBytes());
                      String v1 = foo.getField1();
                      int v2 = foo.getField2();
                      double[] v3 = foo.getField3();
                      return RowFactory.create(v1, v2, v3);
                    })
spark.createDataFrame(rows, schema).createOrReplaceTempView("table1");
spark.sql("...sum/avg/group-by...").show();

Class Foo here is a complex Google protobuf class.

I have several questions:

Will changing the 'spark.serializer' to Kryo have any impact on Foo objects in this case?
In this case, if all columns of DataFrame are either primitive or String or array of primitive/String, from the performance perspective, is it necessary to change the 'spark.serializer' to Kryo?

Many thnaks.

Spark, when we need to enable KryoSerializer?

Answers (0)

Related Questions