user1052610
user1052610

Reputation: 4719

Spark reference table

A Spark RDD contains a collection, each element represents a request.

A Scala function will be passed to the RDD, and, for each RDD element, the function will create a modified request.

For each collection element\request, a lookup table needs to be referenced. The maximum size of the reference table will be 200 rows.

How performance and scalability, how should the lookup table (which is used within the function) be modeled?

  1. Spark Broadcast variable.
  2. Separate Spark RDD.
  3. Scala immutable collection.

Perhaps there is a another option I have not considered.

Thanks

Upvotes: 1

Views: 411

Answers (1)

rhernando
rhernando

Reputation: 1071

It depends on the size of your RDDs, but giving that your reference table will have about 200 rows, I think the best option would be to use a broadcast variable.

If you used a separate RDD, you could make spark to repartition the request RDDs and making an innecesary shuffle.

Upvotes: 0

Related Questions