Reputation: 18162
Spark has a zip()
function to combine two RDDs. It also has functions to split them apart again: keys()
and values()
. But to my surprise, if you ask for just the keys()
, both RDDs are fully computed, even if the values weren't necessary for the computation.
In this example, I create an RDD of (key, value)
pairs, but then I only ask for the keys. Why are the values computed anyway? Does Spark make no attempt to simplify it's internal DAG in such cases?
In [1]: def process_value(val):
...: print "Processing {}".format(val)
...: return 2*val
...:
In [2]: k = sc.parallelize(['a','b','c'])
In [3]: v = sc.parallelize([1,2,3]).map(process_value)
In [4]: zipped = k.zip(v)
In [5]: zipped.keys().collect()
Processing 1
Processing 2
Processing 3
Out[5]: ['a', 'b', 'c']
Upvotes: 0
Views: 59
Reputation: 2216
If you look at the source (at least at 2.0) keys() is simply implemented as
rdd.map(_._1)
I.e. returning the first attribute of the tuple, so the tuple must be fully instantiated.
This might have worked if zip returned RDD[Seq[K, V]]
or some other lazy data structure, but a tuple is not a lazy data structure.
In short: no.
Upvotes: 2