Reputation: 902
What is the difference between the two methods of joining two Pyspark dataframes.
1. Using "createOrReplaceTempView" on both the dataframes and using sparkSession.sql().
2. Using dataframe.alias() on both the dataframes and then join() method
Upvotes: 2
Views: 261
Reputation: 1912
No difference, unless you give any hints or optimizations in your SQL or DataFrame api code. You can write join operations using either DataFrame or SQL API, the operations will go through the same catalyst optimizer and converted to the execution plan.
The physical plan, often called a Spark plan, specifies how the logical plan will execute on the cluster by generating different physical execution strategies and comparing them through a cost model.
Physical planning results in a series of RDDs and transformations. This result is why you might have heard Spark referred to as a compiler - it takes queries in DataFrames, Datasets, and SQL and compiles them into RDD transformations for you.
Upvotes: 2