Performance Optimization

Question

I have 6 tables in hive. I am joining these tables with upcoming Kafka stream data using spark streaming. I have used registerTempTable function and registered all the 6 tables and even incoming Kafka data also. Then I have applied inner join among all the tables.

example -

select * from tableA a 
join tableB b on a.id = b.id     
join tableC c on b.id = c.id
......
......

It took about 3 minutes to complete the join. And I can see lots of data shuffling.

I have used below properties -

  conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  conf.set("spark.kryoserializer.buffer.max", "512")
  conf.set("spark.sql.broadcastTimeout", "36000")
  conf.set("spark.sql.autoBroadcastJoinThreshold", "94371840")

Is there any way to reduce shuffle read and write.

Performance Optimization

Answers (1)

Related Questions