Reputation: 1151
Given partitioned by some_field (of int type) Hive table with data stored as Avro files, I want to query table using Spark SQL in a way that returned Data Frame have to be already partitioned by some_field (used for partitioning).
Query looks like just
SELECT * FROM some_table
By default Spark doesn't do that, returned data_frame.rdd.partitioner is None.
One way to get result is via explicit repartitioning after querying, but probably there is better solution.
HDP 2.6, Spark 2.
Thanks.
Upvotes: 3
Views: 901
Reputation: 330093
First of all you have to distinguish between partitioning of a Dataset
and partitioning of the converted RDD[Row]
. No matter what is the execution plan of the former one, the latter one won't have a Partitioner
:
scala> val df = spark.range(100).repartition(10, $"id")
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> df.rdd.partitioner
res1: Option[org.apache.spark.Partitioner] = None
However internal RDD
, might have a Partitioner
:
scala> df.queryExecution.toRdd.partitioner
res2: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.sql.execution.CoalescedPartitioner@5a05e0f3)
This however is unlikely to help you here, because as of today (Spark 2.2), Data Source API is not aware of the physical storage information (with exception of simple partition pruning). This should change in the upcoming Data Source API. Please refer to the JIRA ticket (SPARK-15689) and design document for details.
Upvotes: 4