Reputation: 295
I am new to spark and hive. I need to understand what happens behind when a hive table is queried in Spark. I am using PySpark
Ex:
warehouse_location = '\user\hive\warehouse'
from pyspark.sql import SparkSession
spark =SparkSession.builder.appName("Pyspark").config("spark.sql.warehouse.dir", warehouse_location).enableHiveSupport().getOrCreate()
DF = spark.sql("select * from hive_table")
In the above case, does the actual SQL run in spark framework or does it run in MapReduce framework of Hive.
I am just wondering how the SQL is being processed. Whether in Hive or in Spark?
Upvotes: 1
Views: 2571
Reputation: 35249
enableHiveSupport()
and HiveContext
are quite misleading, as they suggest some deeper relationship with Hive.
In practice Hive support means that Spark will use Hive metastore to read and write metadata. Before 2.0 there where some additional benefits (window function support, better parser), but this no longer the case today.
Hive support does not imply:
Upvotes: 5
Reputation: 4754
SparkSQL
allows reading and writing data to Hive
tables. In addition to Hive
data, any RDD
can be converted to a DataFrame
, and SparkSQL
can be used to run queries on the DataFrame
.
The actual execution will happen on Spark
. You can check this in your example by running a DF.count()
and track the job via Spark UI
at http://localhost:4040
.
Upvotes: 1