Reputation: 4719
A function is defined to transform an RDD. Therefore, the function is called once for each element in the RDD.
The function needs to call an external web service to look up reference data, passing as a parameter data from the current element in the RDD.
Two questions:
Is there an issue with issuing a web service call within Spark?
The data from the web service needs to be cached. What is the best way to hold (and subsequently reference) the cached data? The simple way would be to hold the cache in a collection with the Scala class which contains the function being passed to the RDD. Would this be efficient, or is there a better approach for caching in Spark?
Thanks
Upvotes: 1
Views: 3079
Reputation: 37852
There isn't really any mechanism for "caching" (in the sense that you mean). Seems like the best approach would be to split this task into two phases:
I'm assuming there would potentially be many records accessing the same lookup key (otherwise "caching" won't be of any value anyway), so performing the external calls for the distinct keys is substantially faster.
How should you implement this?
If you know this set of distinct keys is small enough to fit into your driver machine's memory:
val keys = inputRdd.map(/* get key */).distinct().collect()
use the resulting Map[Key, FetchedValues]
in any transformation on your original RDD - it will be serialized and sent to each worker where you can perform the lookup. For example, assuming the input has records for which the foreignId
field is the lookup key:
val keys = inputRdd.map(record => record.foreignId).distinct().collect()
val lookupTable = keys.map(k => (k, fetchValue(k))).asMap
val withValues = inputRdd.map(record => (record, lookupTable(record.foreignId)))
Alternatively - if this map is large (but still can fit in driver memory), you can broadcast it before you use it in RDD transformation - see Broadcast Variables in Spark's Programming Guide
Otherwise (if this map might be too large) - you'll need to use join
if you want keep data in the cluster, but to still refrain from fetching the same element twice:
val byKeyRdd = inputRdd.keyBy(record => record.foreignId)
val lookupTableRdd = byKeyRdd
.keys()
.distinct()
.map(k => (k, fetchValue(k))) // this time fetchValue is done in cluster - concurrently for different values
val withValues = byKeyRdd.join(lookupTableRdd)
Upvotes: 3