gfytd
gfytd

Reputation: 1809

Spark, when we need to enable KryoSerializer?

I have a spark (version 2.4.7) job,

JavaRDD<Row> rows = javaSparkContext.newAPIHadoopFile(...)
                    .map(d -> {
                      Foo foo = Foo.parseFrom(d._2.copyBytes());
                      String v1 = foo.getField1();
                      int v2 = foo.getField2();
                      double[] v3 = foo.getField3();
                      return RowFactory.create(v1, v2, v3);
                    })
spark.createDataFrame(rows, schema).createOrReplaceTempView("table1");
spark.sql("...sum/avg/group-by...").show();

Class Foo here is a complex Google protobuf class.

I have several questions:

  1. Will changing the 'spark.serializer' to Kryo have any impact on Foo objects in this case?
  2. In this case, if all columns of DataFrame are either primitive or String or array of primitive/String, from the performance perspective, is it necessary to change the 'spark.serializer' to Kryo?

Many thnaks.

Upvotes: 2

Views: 187

Answers (0)

Related Questions