Genarito
Genarito

Reputation: 3433

Why Spark RDD partitionBy method has both number of partitions and partition function?

The method partitionBy signature is RDD.partitionBy(numPartitions, partitionFunc=<function portable_hash>). Why is there both parameters? For example, if I had the following code:

rdd.partitionBy(4, partitionFunc=lambda _key: randint())

How many partitions will be created? 4 as the first parameter? Or as many partitions as random keys generated in partitionFunc? If the first one is correct, what is the sense of the second parameter? Spark documentation is not clear about any parameter in the entire RDD API site...

Upvotes: 0

Views: 316

Answers (1)

Emma
Emma

Reputation: 9308

Essentially, the first argument is how many partition that you divide the data into, and second argument is where the data should be partitioned into.

Here is the short demo.

df = spark.createDataframe([
  [1, "aaaaaa"],
  [2, "bbbbbb"],
  [2, "cccccc"],
  [3, "dddddd"]
], ['num', 'text'])

# partitionBy requires pairwise RDD, so convert the dataframe to pairwise RDD.
pair = df.rdd.map(lambda x: (x[0], x[1:]))

Case 1: Partition by 5, with no explicit part configuration. It will try splitting the data into 5 parts, but if there are less data, there would be some partitions with no data.

pair.partitionBy(5).glom().collect()

[[],
[(1, "aaaaaa",)],
[(2, "bbbbbb",), (2, "cccccc",)],
[(3, "dddddd",)]
[]]

In this case, partition 0 and 4 has no data and partition 1-3 has some data.

Case 2: Partition by 5 and force them into 1's partition.

pair.partitionBy(5, lambda _: 1).glom().collect()

[[],
[(1, "aaaaaa",), (2, "bbbbbb",), (2, "cccccc",), (3, "dddddd",)],
[],
[],
[]]

This time all data is put into partition 1 and nothing in part 0, 2, 3, 4.

Upvotes: 1

Related Questions