Reputation: 611
This is my example.
val arr = Array((1,2), (1,3), (1,4), (2,3), (4,5))
val data = sc.parallelize(arr, 5)
data.glom.map(_length).collect
Array[Int] = Array(1, 1, 1, 1, 1)
val agg = data.reduceByKey(_+_)
agg.glom.map(_.length).collect
Array[Int] = Array(0, 1, 1, 0, 1)
val fil = agg.filter(_._2 < 4)
fil.glom.map(_.length).collect
Array[Int] = Array(0, 0, 1, 0, 0)
val sub = data.map{case(x,y) => (x, (x,y))}.subtractByKey(fil).map(_._2)
Array[(Int, Int)] = Array((1,4), (1,3), (1,2), (4,5))
sub.glom.map(_.length).collect
Array[Int] = Array(0, 3, 0, 0, 1)
What I'm wondering is to evenly distribute partitions.
The data
variable consists of five partitions, with all the data evenly partitioned.
ex)par1: (1,2)
par2: (1,3)
par3: (1,4)
par4: (2,3)
par5: (4,5)
After several transformation operation
, Only two of the five partitions allocated to the sub
variable are used.
The sub
variable consists of five partitions, but not all data is evenly partitioned.
ex)par1: empty
par2: (1,2),(1,3),(1,4)
par3: empty
par4: empty
par5: (4,5)
If I add another transformation operation
to the sub
variable, there will be 5 available partitions, but only 2 partitions will be used for the operation.
ex)sub.map{case(x,y) => (x, x, (x,y))}
So I wanna make use of all available partitions when data is operated on.
I used the repartition
method, but it is not cheaper.
ex) sub.repartition(5).glom.map(_.length).collect
Array[Int] = Array(0, 1, 1, 2, 0)
So I'm looking for a wise way to utilize as many partitions as possible.
Is there a good way?
Upvotes: 2
Views: 1256
Reputation: 13154
So repartition
is definitely the way to go :)
Your example is a little too simple to demonstrate anything as Spark is build to handle billions of rows - not 5 rows. repartition
will not put exactly the same number of rows into each partition, but it will distribute data evenly-ish. Try to redo your example with 1.000.000 rows instead and you will see that data is indeed distributed evenly after a repartition
.
Data skew is often a big problem when working with transformations of large amounts of data, and repartitioning your data does come with the cost of additional time as it needs to shuffle data around. Sometimes it is worth taking the penalty though, because it will make the following transformation stages run faster.
Upvotes: 3