pySpark how does TempView table Joined to Hive table

Question

I have a dataFrame which is registered as tempView and a Hive table to join

    df1.createOrReplaceTempView("mydata")

    df2 = spark.sql("Select md.column1,md.column2,mht.column1 \
                    from mydata md inner join myHivetable mht on mht.key1 = md.key1 \
                     where mht.transdate between '2017-08-01' and '2017-08-10' ")

How does this join happen. Will spark try to read the hive table into the memory or decide to write the tempView table into hive if the data volume is very high in the Hive table.

Adding the following after the first answer for additional details:

Let's say we have

100 row as tempView in Spark called TABLE_A.

A 1 Billion row table in Hive TABLE_B.

Next step we need to join TABLE_A with TABLE_B .

There is a date range condition on TABLE_B.

Since table TABLE_B is big in size. Will spark read the whole table TABLE_B into memory or decide to write TABLE_A to temp space in Hadoop to do a Hive Join or how intelligent it will figure out the best way to do the join for performance

karthikr · Accepted Answer

The Hive context stores the information of registered temp tables/views in the metastore. This allows the SQL like query operations to be performed on the data - and we still get the same performance as we would otherwise.

Some more information on this can be read here and here

pySpark how does TempView table Joined to Hive table

Answers (1)

Related Questions