grv
grv

Reputation: 1153

Where clause versus join clause in Spark SQL

I am writing a query to get records from Table A which satisfies a condition from records in Table B. For example:

Table A is:

Name           Profession        City 
John           Engineer          Palo Alto
Jack           Doctor            SF

Table B is:

Profession     City      NewJobOffer
Engineer       SF           Yes

and I'm interested to get Table c:

Name   Profession  City NewJobOffer
Jack   Engineer    SF      Yes 

I can do this in two ways using where clause or join query which one is faster and why in spark sql? Where clause to compare the columns add select those records or join on the column itself, which is better?

Upvotes: 4

Views: 3829

Answers (1)

Hasan Ammori
Hasan Ammori

Reputation: 403

It's better to provide filter in WHERE clause. These two expressions are not equivalent.

When you provide filtering in JOIN clause, you will have two data sources retrieved and then joined on specified condition. Since join is done through shuffling (redistributing between executors) data first, you are going to shuffle a lot of data.

When you provide filter in WHERE clause, Spark can recognize it and you will have two data sources filtered and then joined. This way you will shuffle less amount of data. What might be even more important is that this way Spark may also be able to do a filter-pushdown, filtering data at datasource level, which means even less network pressure.

Upvotes: 3

Related Questions