Rishabh
Rishabh

Reputation: 902

What is the difference between the two methods in joining two Pyspark dataframes

What is the difference between the two methods of joining two Pyspark dataframes.
1. Using "createOrReplaceTempView" on both the dataframes and using sparkSession.sql().
2. Using dataframe.alias() on both the dataframes and then join() method

Upvotes: 2

Views: 261

Answers (1)

Lakshman Battini
Lakshman Battini

Reputation: 1912

No difference, unless you give any hints or optimizations in your SQL or DataFrame api code. You can write join operations using either DataFrame or SQL API, the operations will go through the same catalyst optimizer and converted to the execution plan.

enter image description here

The physical plan, often called a Spark plan, specifies how the logical plan will execute on the cluster by generating different physical execution strategies and comparing them through a cost model.

Physical planning results in a series of RDDs and transformations. This result is why you might have heard Spark referred to as a compiler - it takes queries in DataFrames, Datasets, and SQL and compiles them into RDD transformations for you.

Upvotes: 2

Related Questions