Reputation: 3316
Suppose I have a RDD with the following type:
RDD[(Long, List(Integer))]
Can I assume that the entire list is located at the same worker? I want to know if certain operations are acceptable on the RDD level or should be calculated at driver. For instance:
val data: RDD[(Long, List(Integer))] = someFunction() //creates list for each timeslot
Please note that the List
may be the result of aggregate
or any other operation and not necessarily be created as one piece.
val diffFromMax = data.map(item => (item._1, findDiffFromMax(item._2)))
def findDiffFromMax(data: List[Integer]): List[Integer] = {
val maxItem = data.max
data.map(item => (maxItem - item))
}
The thing is that is the List
is distributed calculating the maxItem
may cause a lot of network traffic. This can be handles with an RDD of the following type:
RDD[(Long, Integer /*Max Item*/,List(Integer))]
Where the max item is calculated at driver.
So the question (actually 2 questions) are:
((Long, Integer), Double)
?Upvotes: 0
Views: 57
Reputation: 13001
The short answer is yes, your list would be located in a single worker.
Your tuple is a single record in the RDD. A single record is ALWAYS on a single partition (which would be on a single worker). When you do your findDiffFromMax you are running it on the target worker (so the function is serialized to all the workers to run).
The thing you should note is that when you generate a tuple of (k,v) in general this means a key value pair so you can do key based operations on the RDD. The order ( (Long, (Strong, Double)) vs. ((Long, String), Double) or any other way) doesn't really matter as it is all a single record. The only thing that would matter is which is the key in order to do key operations so the question would be the logic of your calculation
Upvotes: 3