Spark reference table

Question

A Spark RDD contains a collection, each element represents a request.

A Scala function will be passed to the RDD, and, for each RDD element, the function will create a modified request.

For each collection element\request, a lookup table needs to be referenced. The maximum size of the reference table will be 200 rows.

How performance and scalability, how should the lookup table (which is used within the function) be modeled?

Spark Broadcast variable.
Separate Spark RDD.
Scala immutable collection.

Perhaps there is a another option I have not considered.

Thanks

rhernando · Accepted Answer

It depends on the size of your RDDs, but giving that your reference table will have about 200 rows, I think the best option would be to use a broadcast variable.

If you used a separate RDD, you could make spark to repartition the request RDDs and making an innecesary shuffle.

Spark reference table

Answers (1)

Related Questions