lucy
lucy

Reputation: 4506

Performance Optimization

I have 6 tables in hive. I am joining these tables with upcoming Kafka stream data using spark streaming. I have used registerTempTable function and registered all the 6 tables and even incoming Kafka data also. Then I have applied inner join among all the tables.

example -

select * from tableA a 
join tableB b on a.id = b.id     
join tableC c on b.id = c.id
......
......

It took about 3 minutes to complete the join. And I can see lots of data shuffling. enter image description here

I have used below properties -

  conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  conf.set("spark.kryoserializer.buffer.max", "512")
  conf.set("spark.sql.broadcastTimeout", "36000")
  conf.set("spark.sql.autoBroadcastJoinThreshold", "94371840")

Is there any way to reduce shuffle read and write.

Upvotes: 1

Views: 179

Answers (1)

Vladislav Varslavans
Vladislav Varslavans

Reputation: 2934

You need to:

  1. Convert DataFrame to Key/Value PairRDD
  2. Partition all PairRDDs with same partitioner
  3. cache() intermediate result
  4. Then you can use RDDs in join operation (but you will need to convert kafka data to PairRDD as well.

This way first join will be slow, however next will be faster, because re-partitioning of data will happen only once.

There are some good hints on joins in spark here

Upvotes: 2

Related Questions