Dev
Dev

Reputation: 13753

Clarity on number of partitions in spark dataframe

from pyspark.sql.types import *

schema = StructType([StructField("type", StringType(), True), StructField("average", IntegerType(), True)])
values = [('A', 19), ('B', 17), ('C', 10)]
df = spark.createDataFrame(values, schema)

parts = df.rdd.getNumPartitions()

print(parts)

Output is 44

How is spark creating 44 partitions for 3 records dataframe?

import pyspark.sql.functions as F
df.withColumn('p_id', F.spark_partition_id()).show()

Output :

+----+-------+----+
|type|average|p_id|
+----+-------+----+
|   A|     19|  14|
|   B|     17|  29|
|   C|     10|  43|
+----+-------+----+

Upvotes: 1

Views: 539

Answers (2)

Gelerion
Gelerion

Reputation: 1724

When Dataset/Dataframe is created out of a collection it does take rows number into account. Eventually it comes down to LocalTableScanExec, look here

numParallelism: Int = math.min(math.max(unsafeRows.length, 1),  sqlContext.sparkContext.defaultParallelism)
rdd = sqlContext.sparkContext.parallelize(unsafeRows, numParallelism)

Where unsafeRows.length equals to the provided collection size.

Also, look at this answer for several related settings.

Upvotes: 1

Serge Harnyk
Serge Harnyk

Reputation: 1339

Cause Spark initially create N number of partitions regardless of data. For example, I've run Spark locally with "local[4]" and create a DF from 2 rows df.rdd().getNumPartitions() would return 4, cause there is 4 cores for Spark job.

If I do the next:

df.repartition(2).rdd().getNumPartitions()

result would be 2.

Upvotes: 0

Related Questions