Spark: Extract zipped keys without computing values

Question

Spark has a zip() function to combine two RDDs. It also has functions to split them apart again: keys() and values(). But to my surprise, if you ask for just the keys(), both RDDs are fully computed, even if the values weren't necessary for the computation.

In this example, I create an RDD of (key, value) pairs, but then I only ask for the keys. Why are the values computed anyway? Does Spark make no attempt to simplify it's internal DAG in such cases?

In [1]: def process_value(val):
   ...:     print "Processing {}".format(val)
   ...:     return 2*val
   ...:

In [2]: k = sc.parallelize(['a','b','c'])

In [3]: v = sc.parallelize([1,2,3]).map(process_value)

In [4]: zipped = k.zip(v)

In [5]: zipped.keys().collect()
Processing 1
Processing 2
Processing 3

Out[5]: ['a', 'b', 'c']

placeybordeaux · Accepted Answer

If you look at the source (at least at 2.0) keys() is simply implemented as

rdd.map(_._1)

I.e. returning the first attribute of the tuple, so the tuple must be fully instantiated.

This might have worked if zip returned RDD[Seq[K, V]] or some other lazy data structure, but a tuple is not a lazy data structure.

In short: no.

Spark: Extract zipped keys without computing values

Answers (1)

Related Questions