Reputation: 4506
I have 6 tables in hive. I am joining these tables with upcoming Kafka stream data using spark streaming. I have used registerTempTable function and registered all the 6 tables and even incoming Kafka data also. Then I have applied inner join among all the tables.
example -
select * from tableA a
join tableB b on a.id = b.id
join tableC c on b.id = c.id
......
......
It took about 3 minutes to complete the join. And I can see lots of data shuffling.
I have used below properties -
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryoserializer.buffer.max", "512")
conf.set("spark.sql.broadcastTimeout", "36000")
conf.set("spark.sql.autoBroadcastJoinThreshold", "94371840")
Is there any way to reduce shuffle read and write.
Upvotes: 1
Views: 179
Reputation: 2934
You need to:
cache()
intermediate resultThis way first join will be slow, however next will be faster, because re-partitioning of data will happen only once.
There are some good hints on joins in spark here
Upvotes: 2