parallelization level of Tupled RDD data

Question

Suppose I have a RDD with the following type:

RDD[(Long, List(Integer))]

Can I assume that the entire list is located at the same worker? I want to know if certain operations are acceptable on the RDD level or should be calculated at driver. For instance:

val data: RDD[(Long, List(Integer))] = someFunction() //creates list for each timeslot

Please note that the List may be the result of aggregate or any other operation and not necessarily be created as one piece.

val diffFromMax = data.map(item => (item._1, findDiffFromMax(item._2)))

def findDiffFromMax(data: List[Integer]): List[Integer] = {
  val maxItem = data.max
  data.map(item => (maxItem - item))
}

The thing is that is the List is distributed calculating the maxItem may cause a lot of network traffic. This can be handles with an RDD of the following type:

RDD[(Long, Integer /*Max Item*/,List(Integer))]

Where the max item is calculated at driver.

So the question (actually 2 questions) are:

At what point of RDD data I can assume that the data is located at one worker? (answers with reference to doc or personal evaluations would be great) if any? what happens in the case of Tuple inside Tuple: ((Long, Integer), Double)?
What is the common practice for design of algorithms with Tuples? Should I always treat the data as if it may appear on different workers? should I always break it to the minimal granularity at the first Tuple field - for a case where there is data(Double) for user(String) in timeslot(Long) - should the data be (Long, (Strong, Double)) or ((Long, String), Double) or maybe (String, (Long, Double))? or maybe this is not optimal and matrices are better?

Assaf Mendelson · Accepted Answer

The short answer is yes, your list would be located in a single worker.

Your tuple is a single record in the RDD. A single record is ALWAYS on a single partition (which would be on a single worker). When you do your findDiffFromMax you are running it on the target worker (so the function is serialized to all the workers to run).

The thing you should note is that when you generate a tuple of (k,v) in general this means a key value pair so you can do key based operations on the RDD. The order ( (Long, (Strong, Double)) vs. ((Long, String), Double) or any other way) doesn't really matter as it is all a single record. The only thing that would matter is which is the key in order to do key operations so the question would be the logic of your calculation

parallelization level of Tupled RDD data

Answers (1)

Related Questions