Ahmed Gamal
Ahmed Gamal

Reputation: 31

What is the best way to read Hive Table through Spark SQL?

I execute Spark SQL reading from Hive Tables and it is lengthy in execution(15 min). I am interested in optimizing the query execution so I am asking about if the execution for those queries uses the execution engine of Hive and by this way it is similar to executing the queries in Hive editor, or Spark use the Hive Metastore only to know the locations of the files and deals with the files after that directly?

import os
import findspark
findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("yarn") \
    .appName("src_count") \
    .config('spark.executor.cores','5') \
    .config('spark.executor.memory','29g') \
    .config('spark.driver.memory','16g') \
    .config('spark.driver.maxResultSize','12g')\
    .config("spark.dynamicAllocation.enabled", "true")\
    .config("spark.shuffle.service.enabled", "true")\
    .getOrCreate()
sql = "SELECT S.SERVICE, \
       COUNT(DISTINCT CONTRACT_KEY) DISTINCT_CNT, \
       COUNT(*) CNT ... "
df.toPandas()

Upvotes: 2

Views: 7977

Answers (1)

Neeraj Bhadani
Neeraj Bhadani

Reputation: 3110

You can read the HIVE table as follows:

  1. Read Entire HIVE Table

df = spark.table(<HIVE_DB>.<HIVE_TBL>)

  1. You can read the partial table based on SQL query.

df = spark.sql(<YOUR_SQL_Query>)

Also, in your question you are trying to convert the Spark DataFrame to Python DataFrame which is not recommended. Because, in this case you are sending all data from worker to driver which trafer lof of data across the network and slowdown the application and also your driver will be overloaded since it will have entire dataset and it may go to OOM as well.

Upvotes: 3

Related Questions