Understanding how Hive SQL gets executed in Spark

Question

I am new to spark and hive. I need to understand what happens behind when a hive table is queried in Spark. I am using PySpark

Ex:

warehouse_location = '\user\hive\warehouse'
from pyspark.sql import SparkSession
spark =SparkSession.builder.appName("Pyspark").config("spark.sql.warehouse.dir", warehouse_location).enableHiveSupport().getOrCreate()

DF = spark.sql("select * from hive_table")

In the above case, does the actual SQL run in spark framework or does it run in MapReduce framework of Hive.

I am just wondering how the SQL is being processed. Whether in Hive or in Spark?

Jagrut Sharma · Accepted Answer

SparkSQL allows reading and writing data to Hive tables. In addition to Hive data, any RDD can be converted to a DataFrame, and SparkSQL can be used to run queries on the DataFrame.

The actual execution will happen on Spark. You can check this in your example by running a DF.count() and track the job via Spark UI at http://localhost:4040.

Understanding how Hive SQL gets executed in Spark

Answers (2)

Related Questions