Reputation: 1
My dataset is made up of data points which are 5000-element arrays (of Doubles) and each data point has a clusterId assigned to it.
For the purposes of the problem I am solving, I need to aggregate those arrays (element-wise) per clusterId and then do a dot product calculation between each data point and its respective aggregate cluster array.
The total number of data points I am dealing with is 4.8mm and they are split across ~50k clusters.
I use 'reduceByKey' to get the aggregated arrays per clusterId (which is my key) - using this dataset, I have two distinct options:
My understanding is that joins cause re-partitioning based on the join key, so in my case, the unique values of my key are ~50k, which will be quite slow.
What I tried is the 2nd approach - I managed to collect the RDD localy - and turn it into a Map of clusterId as the key and 5000-element Array[Double] as the value.
However, when I try to broadcast/serialize this variable into a Closure, I am getting a ''java.lang.OutOfMemoryError: Requested array size exceeds VM limit''.
My question is - given the nature of my problem where I need to make aggregate data available to each executor, what is the best way to approach this, given that the aggregate dataset (in my case 50k x 5000) could be quite large?
Thanks
Upvotes: 0
Views: 682
Reputation: 18434
I highly recommend the join. 5000 values x 50,000 elements x 8 bytes per value is already 2 GB, which is manageable, but it's definitely in the "seriously slow things down, and maybe break some stuff" ballpark.
You are right that repartitioning can sometimes be slow, but I think you are more concerned about it than necessary. It's still an entirely parallel/distributed operation, which makes it essentially infinitely scalable. Collecting things into the driver is not.
Upvotes: 0