How to efficiently distribute and use partitions in spark?

Question

This is my example.

val arr = Array((1,2), (1,3), (1,4), (2,3), (4,5))
val data = sc.parallelize(arr, 5)

data.glom.map(_length).collect
Array[Int] = Array(1, 1, 1, 1, 1)

val agg = data.reduceByKey(_+_)
agg.glom.map(_.length).collect
Array[Int] = Array(0, 1, 1, 0, 1)

val fil = agg.filter(_._2 < 4)
fil.glom.map(_.length).collect
Array[Int] = Array(0, 0, 1, 0, 0)

val sub = data.map{case(x,y) => (x, (x,y))}.subtractByKey(fil).map(_._2)
Array[(Int, Int)] = Array((1,4), (1,3), (1,2), (4,5))

sub.glom.map(_.length).collect
Array[Int] = Array(0, 3, 0, 0, 1)

What I'm wondering is to evenly distribute partitions.

The data variable consists of five partitions, with all the data evenly partitioned.

ex)par1: (1,2)
   par2: (1,3)
   par3: (1,4)
   par4: (2,3)
   par5: (4,5)

After several transformation operation, Only two of the five partitions allocated to the sub variable are used.

The sub variable consists of five partitions, but not all data is evenly partitioned.

ex)par1: empty
   par2: (1,2),(1,3),(1,4)
   par3: empty
   par4: empty
   par5: (4,5)

If I add another transformation operation to the sub variable, there will be 5 available partitions, but only 2 partitions will be used for the operation.

ex)sub.map{case(x,y) => (x, x, (x,y))}

So I wanna make use of all available partitions when data is operated on.

I used the repartition method, but it is not cheaper.

ex) sub.repartition(5).glom.map(_.length).collect
Array[Int] = Array(0, 1, 1, 2, 0)

So I'm looking for a wise way to utilize as many partitions as possible.

Is there a good way?

How to efficiently distribute and use partitions in spark?

Answers (1)

Related Questions