letsBeePolite
letsBeePolite

Reputation: 2251

Divide the RDD into partitions with fixed number of elements in each partition

In Apache Spark,

repartition(n) - allows partitioning the RDD into exactly n partitions.

But how to partition the given RDD into partitions such that all partitions (exception for the last partition) have specified number of elements. Given that number of elements in RDD is not known and doing .count() is expensive.

C = sc.parallelize([x for x in range(10)],2)
Let's say internally,  C = [[0,1,2,3,4,5], [6,7,8,9]]  
C = someCode(3)

Expected:

C = [[0,1,2], [3,4,5], [6, 7, 8], [9]]

Upvotes: 0

Views: 833

Answers (1)

Sergio Alyoshkin
Sergio Alyoshkin

Reputation: 212

Quite easy in pyspark:

    C = sc.parallelize([x for x in range(10)],2)
    rdd = C.map(lambda x : (x, x))
    C_repartitioned = rdd.partitionBy(4,lambda x: int( x *4/11)).map(lambda x: x[0]).glom().collect()
    C_repartitioned

    [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]

It is called custom partitioning. More on that: http://sparkdatasourceapi.blogspot.ru/2016/10/patitioning-in-spark-writing-custom.html

http://baahu.in/spark-custom-partitioner-java-example/

Upvotes: 0

Related Questions