Reputation: 3433
The method partitionBy signature is RDD.partitionBy(numPartitions, partitionFunc=<function portable_hash>)
. Why is there both parameters? For example, if I had the following code:
rdd.partitionBy(4, partitionFunc=lambda _key: randint())
How many partitions will be created? 4 as the first parameter? Or as many partitions as random keys generated in partitionFunc
? If the first one is correct, what is the sense of the second parameter? Spark documentation is not clear about any parameter in the entire RDD API site...
Upvotes: 0
Views: 316
Reputation: 9308
Essentially, the first argument is how many partition that you divide the data into, and second argument is where the data should be partitioned into.
Here is the short demo.
df = spark.createDataframe([
[1, "aaaaaa"],
[2, "bbbbbb"],
[2, "cccccc"],
[3, "dddddd"]
], ['num', 'text'])
# partitionBy requires pairwise RDD, so convert the dataframe to pairwise RDD.
pair = df.rdd.map(lambda x: (x[0], x[1:]))
Case 1: Partition by 5, with no explicit part configuration. It will try splitting the data into 5 parts, but if there are less data, there would be some partitions with no data.
pair.partitionBy(5).glom().collect()
[[],
[(1, "aaaaaa",)],
[(2, "bbbbbb",), (2, "cccccc",)],
[(3, "dddddd",)]
[]]
In this case, partition 0 and 4 has no data and partition 1-3 has some data.
Case 2: Partition by 5 and force them into 1's partition.
pair.partitionBy(5, lambda _: 1).glom().collect()
[[],
[(1, "aaaaaa",), (2, "bbbbbb",), (2, "cccccc",), (3, "dddddd",)],
[],
[],
[]]
This time all data is put into partition 1 and nothing in part 0, 2, 3, 4.
Upvotes: 1