Reputation: 4582

Why is it not possible to determine the partitions in a dataframe if it is possible to get the count of partitions in Spark?

Using df.rdd.getNumPartitions(), we can get the count of partitions. But how do we get the partitions?

I also tried to pick something up from the documentation and all the attributes (using dir(df)) of a dataframe. However, I could not find any API that would give the partitions, only repartitioning, coalesce, getNumPartitions were all that I could find.

I read this and deduced that Spark does not know the partitioning key(s). My doubt is, if it does not know the partitioning key(s), and hence, does not know the partitions, how can it know their count? If it can, how to determine the partitions?

Upvotes: 0

Answers (2)

mazaneicha

Reputation: 9417

pyspark provides the spark_partition_id() function.

spark_partition_id()

A column for partition ID.

Note: This is indeterministic because it depends on data partitioning and task scheduling.

>>> from pyspark.sql.functions import *
>>> spark.range(1,1000000)
      .withColumn("spark_partition",spark_partition_id())
      .groupby("spark_partition")
      .count().show(truncate=False)
+---------------+------+
|spark_partition|count |
+---------------+------+
|1              |500000|
|0              |499999|
+---------------+------+

Partitions are numbered from zero to n-1 where n is the number you get from getNumPartitions().

Is that what you're after? Or did you actually mean Hive partitions?

Upvotes: 1

luk

Reputation: 105

How about checking what the partition contains using mapPartitionsWithIndex

This code will work for some small dataset

def f(splitIndex, elements): 
  elements_text = ",".join(list(elements))
  yield splitIndex, elements_text

rdd.mapPartitionsWithIndex(f).take(10)

Upvotes: 2

Why is it not possible to determine the partitions in a dataframe if it is possible to get the count of partitions in Spark?

Answers (2)

Related Questions