Reputation: 13753
from pyspark.sql.types import *
schema = StructType([StructField("type", StringType(), True), StructField("average", IntegerType(), True)])
values = [('A', 19), ('B', 17), ('C', 10)]
df = spark.createDataFrame(values, schema)
parts = df.rdd.getNumPartitions()
print(parts)
Output is 44
How is spark creating 44 partitions for 3 records dataframe?
import pyspark.sql.functions as F
df.withColumn('p_id', F.spark_partition_id()).show()
Output :
+----+-------+----+
|type|average|p_id|
+----+-------+----+
| A| 19| 14|
| B| 17| 29|
| C| 10| 43|
+----+-------+----+
Upvotes: 1
Views: 539
Reputation: 1724
When Dataset/Dataframe is created out of a collection it does take rows number into account.
Eventually it comes down to LocalTableScanExec
, look here
numParallelism: Int = math.min(math.max(unsafeRows.length, 1), sqlContext.sparkContext.defaultParallelism)
rdd = sqlContext.sparkContext.parallelize(unsafeRows, numParallelism)
Where unsafeRows.length
equals to the provided collection size.
Also, look at this answer for several related settings.
Upvotes: 1
Reputation: 1339
Cause Spark initially create N number of partitions regardless of data. For example, I've run Spark locally with "local[4]" and create a DF from 2 rows
df.rdd().getNumPartitions()
would return 4, cause there is 4 cores for Spark job.
If I do the next:
df.repartition(2).rdd().getNumPartitions()
result would be 2.
Upvotes: 0