Spark - behaviour of first() operation

Question

I'm trying to understand the jobs that get created by Spark for simple first() vs collect() operations.

Given the code:

myRDD = spark.sparkContext.parallelize(['A', 'B', 'C'])

def func(d):
    return d + '-foo'

myRDD = myRDD.map(func)

My RDD is split across 16 partitions:

print(myRDD.toDebugString())
(16) PythonRDD[24] at RDD at PythonRDD.scala:48 []
 |   ParallelCollectionRDD[23] at parallelize at PythonRDD.scala:475 []

If I call:

myRDD.collect()

I get 1 job with 16 tasks created. I assume this is one task per partition.

However, if I call:

myRDD.first()

I get 3 jobs, with 1, 4, and 11 tasks created. Why have 3 jobs been created?

I'm running spark-2.0.1 with a single 16-core executor, provisioned by Mesos.

Mariusz · Accepted Answer

It is actually pretty smart Spark behaviour. Your map() is transformation (it is lazy-evaluated) and both first() and collect() are actions (terminal operations). All transformations are applied to the data in time you call actions.

When you call first() then spark tries to perform as low number of operations (transformations) as possible. First, it tries one random partition. If there are no results, it takes 4 times partitions more and calculates. Again, if there are no results, spark takes 4 times partitions more (5 * 4) and again tries to get any result.

In your case in this third try you have only 11 untouched partitions (16 - 1 - 4). If you have more data in RDD or less number of partitions, spark probably can find the first() result sooner.

Spark - behaviour of first() operation

Answers (1)

Related Questions