Valentin P.
Valentin P.

Reputation: 1151

Spark SQL partition awareness querying hive table

Given partitioned by some_field (of int type) Hive table with data stored as Avro files, I want to query table using Spark SQL in a way that returned Data Frame have to be already partitioned by some_field (used for partitioning).

Query looks like just

SELECT * FROM some_table

By default Spark doesn't do that, returned data_frame.rdd.partitioner is None.

One way to get result is via explicit repartitioning after querying, but probably there is better solution.

HDP 2.6, Spark 2.

Thanks.

Upvotes: 3

Views: 901

Answers (1)

zero323
zero323

Reputation: 330093

First of all you have to distinguish between partitioning of a Dataset and partitioning of the converted RDD[Row]. No matter what is the execution plan of the former one, the latter one won't have a Partitioner:

scala> val df = spark.range(100).repartition(10, $"id")
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.rdd.partitioner
res1: Option[org.apache.spark.Partitioner] = None

However internal RDD, might have a Partitioner:

scala> df.queryExecution.toRdd.partitioner
res2: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.sql.execution.CoalescedPartitioner@5a05e0f3)

This however is unlikely to help you here, because as of today (Spark 2.2), Data Source API is not aware of the physical storage information (with exception of simple partition pruning). This should change in the upcoming Data Source API. Please refer to the JIRA ticket (SPARK-15689) and design document for details.

Upvotes: 4

Related Questions