Reputation: 4582
Using df.rdd.getNumPartitions()
, we can get the count of partitions. But how do we get the partitions?
I also tried to pick something up from the documentation and all the attributes (using dir(df)
) of a dataframe. However, I could not find any API that would give the partitions, only repartitioning
, coalesce
, getNumPartitions
were all that I could find.
I read this and deduced that Spark does not know the partitioning key(s). My doubt is, if it does not know the partitioning key(s), and hence, does not know the partitions, how can it know their count? If it can, how to determine the partitions?
Upvotes: 0
Views: 693
Reputation: 9417
pyspark provides the spark_partition_id()
function.
spark_partition_id()
A column for partition ID.
Note: This is indeterministic because it depends on data partitioning and task scheduling.
>>> from pyspark.sql.functions import *
>>> spark.range(1,1000000)
.withColumn("spark_partition",spark_partition_id())
.groupby("spark_partition")
.count().show(truncate=False)
+---------------+------+
|spark_partition|count |
+---------------+------+
|1 |500000|
|0 |499999|
+---------------+------+
Partitions are numbered from zero to n-1
where n
is the number you get from getNumPartitions()
.
Is that what you're after? Or did you actually mean Hive partitions?
Upvotes: 1
Reputation: 105
How about checking what the partition contains using mapPartitionsWithIndex
This code will work for some small dataset
def f(splitIndex, elements):
elements_text = ",".join(list(elements))
yield splitIndex, elements_text
rdd.mapPartitionsWithIndex(f).take(10)
Upvotes: 2