Harish
Harish

Reputation: 295

Understanding how Hive SQL gets executed in Spark

I am new to spark and hive. I need to understand what happens behind when a hive table is queried in Spark. I am using PySpark

Ex:

warehouse_location = '\user\hive\warehouse'
from pyspark.sql import SparkSession
spark =SparkSession.builder.appName("Pyspark").config("spark.sql.warehouse.dir", warehouse_location).enableHiveSupport().getOrCreate()

DF = spark.sql("select * from hive_table")

In the above case, does the actual SQL run in spark framework or does it run in MapReduce framework of Hive.

I am just wondering how the SQL is being processed. Whether in Hive or in Spark?

Upvotes: 1

Views: 2571

Answers (2)

Alper t. Turker
Alper t. Turker

Reputation: 35249

enableHiveSupport() and HiveContext are quite misleading, as they suggest some deeper relationship with Hive.

In practice Hive support means that Spark will use Hive metastore to read and write metadata. Before 2.0 there where some additional benefits (window function support, better parser), but this no longer the case today.

Hive support does not imply:

  • Full Hive Query Language compatibility.
  • Any form of computation on Hive.

Upvotes: 5

Jagrut Sharma
Jagrut Sharma

Reputation: 4754

SparkSQL allows reading and writing data to Hive tables. In addition to Hive data, any RDD can be converted to a DataFrame, and SparkSQL can be used to run queries on the DataFrame.

The actual execution will happen on Spark. You can check this in your example by running a DF.count() and track the job via Spark UI at http://localhost:4040.

Upvotes: 1

Related Questions