Performance consideration when reading from hive view Vs hive table via DataFrames

Question

We have a view that unions multiple hive tables. If i use spark SQL in pyspark and read that view will there be any performance issue as against reading directly from the table. In hive we had something called full table scan if we don't limit the where clause to an exact table partition. Is spark intelligent enough to directly read the table that has the data that we are looking for rather than searching through the entire view ? Please advise.

Strick · Accepted Answer

You are talking about partition pruning. Yes spark supports it spark automatically omits large data read when partition filters are specified.

Partition pruning is possible when data within a table is split across multiple logical partitions. Each partition corresponds to a particular value of a partition column and is stored as a subdirectory within the table root directory on HDFS. Where applicable, only the required partitions (subdirectories) of a table are queried, thereby avoiding unnecessary I/O

After partitioning the data, subsequent queries can omit large amounts of I/O when the partition column is referenced in predicates. For example, the following query automatically locates and loads the file under peoplePartitioned/age=20/and omits all others:

val peoplePartitioned = spark.read.format("orc").load("peoplePartitioned")
peoplePartitioned.createOrReplaceTempView("peoplePartitioned") 
spark.sql("SELECT * FROM peoplePartitioned WHERE age = 20")

more detailed info is provided here

You can also see this in the logical plan if you run an explain(True) on your query:

spark.sql("SELECT * FROM peoplePartitioned WHERE age = 20").explain(True)

it will show which partitions are read by spark

Performance consideration when reading from hive view Vs hive table via DataFrames

Answers (1)

Related Questions