Order of table joining in SparkSQL for better performance

Question

I am new to Spark-SQL to read Hive tables. I want to know that how does Spark performs a multi-table Join. I read somewhere that it is recommended to always keep the largest table on the top of the Join order and so on, which is conducive for Join efficiency. I read that in a Join, Spark loads the first table (largest) in the order to Memory and streams the other table which aids in Join performance. However, I am confused as to how this strategy will boost performance since the largest table (in most cases) will not fit in memory and spill on disk.

Can anyone please clarify and explain the Joining mechanism employed by Spark under the hood when joining [large vs medium], [large vs small] and [large vs large] tables in terms of Join types (inner & outer) and Join performance. I want to know the best practices that should be followed in terms of Join table ordering to achieve optimal performance for all the join strategies (SMJ, ShuffleHash & Broadcast) employed by Spark. Let's assume the following query:

select 
a.id, 
b.cust_nm, 
c.prod_nm
from large_tab a
join medium_tab b
on a.id = b.id
join small_tab c
on a.pid = c.pid;

Note: We use Spark 2.4

Any help is much appreciated. Thanks.

Order of table joining in SparkSQL for better performance

Answers (1)

Related Questions