Spark DataFrame partitioner is None

Question

[New to Spark] After creating a DataFrame I am trying to partition it based on a column in the DataFrame. When I check the partitioner using data_frame.rdd.partitioner I get None as output.

Partitioning using ->

data_frame.repartition("column_name")

As per Spark documentation the default partitioner is HashPartitioner, how can I confirm that ?

Also, how can I change the partitioner ?

10465355 · Accepted Answer

That's to be expected. RDD converted from a Dataset doesn't preserve the partitioner, only the data distribution.

If you want to inspect partitioner of the RDD you should retrieve it from the queryExecution:

scala> val df = spark.range(100).select($"id" % 3 as "id").repartition(42, $"id")
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint]

scala> df.queryExecution.toRdd.partitioner
res1: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.sql.execution.CoalescedPartitioner@4be2340e)

how can I change the partitioner ?

In general you cannot. There exist repartitionByRange method (see the linked thread), but otherwise Dataset Partitioner is not configurable.

Spark DataFrame partitioner is None

Answers (1)

Related Questions