Reputation: 4719
A Spark RDD contains a collection, each element represents a request.
A Scala function will be passed to the RDD, and, for each RDD element, the function will create a modified request.
For each collection element\request, a lookup table needs to be referenced. The maximum size of the reference table will be 200 rows.
How performance and scalability, how should the lookup table (which is used within the function) be modeled?
Perhaps there is a another option I have not considered.
Thanks
Upvotes: 1
Views: 411
Reputation: 1071
It depends on the size of your RDDs, but giving that your reference table will have about 200 rows, I think the best option would be to use a broadcast variable.
If you used a separate RDD, you could make spark to repartition the request RDDs and making an innecesary shuffle.
Upvotes: 0