Serialize/Broadcast large Map in Spark + Scala

Question

My dataset is made up of data points which are 5000-element arrays (of Doubles) and each data point has a clusterId assigned to it.

For the purposes of the problem I am solving, I need to aggregate those arrays (element-wise) per clusterId and then do a dot product calculation between each data point and its respective aggregate cluster array.

The total number of data points I am dealing with is 4.8mm and they are split across ~50k clusters.

I use 'reduceByKey' to get the aggregated arrays per clusterId (which is my key) - using this dataset, I have two distinct options:

join the aggregate (clusterId, aggregateVector) pairs to the original dataset - so that each aggregateVector is available to each partition
collect the rdd of (clusterId, aggregateVector) locally and serialize it back to my executors - again, so that I can make the aggregateVectors available to each partition

My understanding is that joins cause re-partitioning based on the join key, so in my case, the unique values of my key are ~50k, which will be quite slow.

What I tried is the 2nd approach - I managed to collect the RDD localy - and turn it into a Map of clusterId as the key and 5000-element Array[Double] as the value.

However, when I try to broadcast/serialize this variable into a Closure, I am getting a ''java.lang.OutOfMemoryError: Requested array size exceeds VM limit''.

My question is - given the nature of my problem where I need to make aggregate data available to each executor, what is the best way to approach this, given that the aggregate dataset (in my case 50k x 5000) could be quite large?

Thanks

Serialize/Broadcast large Map in Spark + Scala

Answers (1)

Related Questions