Reputation: 2251
In Apache Spark,
repartition(n)
- allows partitioning the RDD into exactly n
partitions.
But how to partition the given RDD into partitions such that all partitions (exception for the last partition) have specified number of elements. Given that number of elements in RDD is not known and doing .count()
is expensive.
C = sc.parallelize([x for x in range(10)],2)
Let's say internally, C = [[0,1,2,3,4,5], [6,7,8,9]]
C = someCode(3)
Expected:
C = [[0,1,2], [3,4,5], [6, 7, 8], [9]]
Upvotes: 0
Views: 833
Reputation: 212
Quite easy in pyspark:
C = sc.parallelize([x for x in range(10)],2)
rdd = C.map(lambda x : (x, x))
C_repartitioned = rdd.partitionBy(4,lambda x: int( x *4/11)).map(lambda x: x[0]).glom().collect()
C_repartitioned
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
It is called custom partitioning. More on that: http://sparkdatasourceapi.blogspot.ru/2016/10/patitioning-in-spark-writing-custom.html
http://baahu.in/spark-custom-partitioner-java-example/
Upvotes: 0